# Q-learning
This notebook contains an application of the TaxiCab reinforcement learning environment used as a metaphor for employee training programs, with the goal of maximizing profit from increased employee learning and minimizing cost of training.

---

# Introduction to Reinforcement Learning  

Reinforcement learning is learning what to do; how to map situations to actions as to maximize a numerical reward signal. The learner, or agent, is not told which actions to take, but to instead discover which actions yield the most reward by trying them. This type of model can be thought of a specific instance of Markov decision processes (MDP's). The learner and action maker is called the **agent**. The thing that the agent interacts with is called the **environment**, which can be thought of as everything outside the agent. 

The agent and environment interact in a looping process, where the agent observes some portion of the environment and takes an action, after which, the environment responds and presents a new situation to the agent. More specifically, the agent and environment  interact at each of a sequence of discrete time steps $t = 0, 1, \dots, T$, where $T$ is the **terminal state**. At each time step $t$, the agent recieves some representation of the environment's **state**, $S_t \in \mathcal{S}$, and on that basis selects an **action**, $A_t \in \mathcal{A}$. Here, $\mathcal{S}$ is the set of all possible states and $\mathcal{A}$ is the set of all possible/valid actions. One time step later, and in part as a consequence of action $A_t$, the agents recieves a numerical **reward**, $R_{t+1} \in \mathbb{R}$, and finds itself in a new state, $S_{t+1}\in \mathcal{S}$. The MDP and agent together give rise to a sequence, or **trajectory** typically denoted by $\tau$:

$$
\tau = S_0, A_0, R_1, S_1, A_1, R_2, S_2, A_2, R_3, \dots 
$$


The goal of the agent is to maximize its rewards over a given trajectory starting from state $S_t$. This is called the **return** and is given with the following equation:

$$
G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} \dots, 
$$

where $\gamma \in [0, 1]$ is the **discount rate**. The discount rate determines the present value of the future rewards. This formula can be written recursively as:

$$
G_{t} = R_{t+1} + \gamma G_{t+1}. 
$$



## Policies and the Q-Function

A **policy** is a mapping from states to probabilities of selecting each possible action. 

$$
q_{\pi}(s, a) = \mathbb{E}_{\pi}\Big[ G_t | S_t = s, A_t = a\Big] = \mathbb{E}_{\pi}\Big[ G_t + \gamma G_{t+1}| S_t = s, A_t = a\Big]
$$

Let $Q(S_t, A_t)$ denote the current q-value of the state action pair $(S_t, A_t)$. Through experience, the agent can learn how well our current estimate is (just like we compare predicted labels to true labels in supervised learning). The agent can then update the value of $Q(S_t, A_t)$ after experiencing its future rewards. The following update rule illustrates this updating:

$$
Q(S, A) \leftarrow Q(S, A) + \alpha \Big[R + \gamma \max_{a}Q(S', a) - Q(S, A) \Big]
$$



---

In [2]:
# Import the gym environment 
import gym

# Instantiate the taxi environment 
env = gym.make("Taxi-v3")

# Reset the environment 
env.reset()

# Show the current frame of the environment 
env.render()

+---------+
|R: | : :[35mG[0m|
| : | :[43m [0m: |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+



In [3]:
import numpy as np


# Initialize Q-values as a Q-table of 0's 
q_table = np.zeros([env.observation_space.n, env.action_space.n])

print(f"The shape of the Q-table is: {q_table.shape} \n")

The shape of the Q-table is: (500, 6) 



In [4]:
%%time
"""Training the agent"""

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []
frames = []
for i in range(1, 10_001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        # Q-table update Rule 
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        
        state = next_state
        epochs += 1
        
    
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")


 
state = env.encode(0, 2, 1, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)
print(q_table[state])
env.s = state
print(np.argmax(state))
env.render()

Episode: 10000
Training finished.

State: 44
[-2.43927814 -2.42957971 -2.41837065 -2.43537902 -8.41599481 -7.30235066]
0
+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
Wall time: 5.15 s


In [5]:
state = env.encode(0, 2, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()

State: 48
+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)


In [6]:
q_table[48]

array([-2.41832002, -2.41874629, -2.42279029, -2.42441559, -5.76406888,
       -8.67494605])

In [7]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 2
frames = []
for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1
        # Put each rendered frame into dict for animation
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward
                            }
            )
        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after training {episodes} employees:")
print(f"Average number of trainings each employee needed to participate in: {total_epochs / episodes}")
print(f"Average number of incorrect trainings per employee: {total_penalties / episodes}")

Results after training 2 employees:
Average number of trainings each employee needed to participate in: 10.0
Average number of incorrect trainings per employee: 0.0


In [9]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    total_cost = 0
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        #print(frame['frame'].getvalue())
        print(f"Total number of actions to date: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Company revenue due to training for that action (in thousands of dollars): {frame['reward']}")
        total_cost += frame['reward']
        print(f"To date company revenue due to training (in thousands of dollars): {total_cost}")
        sleep(.1)
        

print_frames(frames)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35m[34;1m[43mY[0m[0m[0m| : |B: |
+---------+
  (Dropoff)

Total number of actions to date: 20
State: 410
Action: 5
Company revenue due to training for that action (in thousands of dollars): 20
To date company revenue due to training (in thousands of dollars): 22
