# Q-learning
This notebook contains an application of the TaxiCab reinforcement learning environment used as a metaphor for employee training programs, with the goal of maximizing profit from increased employee learning and minimizing cost of training. The environment that will be used is explained in [this](https://github.com/Madison-Bunting/INDE-577/blob/main/reinforcement%20learning/1%20-%20q-learning/1-%20Reinforcement%20Learning%20Environment%20Introduction.ipynb) notebook.

This environment already has the actions and states defined along with the rewards, and creating the environment also creates the agent. 

In [2]:
# Import the gym environment 
import gym

# Instantiate the taxi environment 
env = gym.make("Taxi-v3")

# Reset the environment 
env.reset()

# Show the current frame of the environment 
env.render()

+---------+
|R: | : :[35mG[0m|
| : | :[43m [0m: |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+



Create the Q-table to keep track of the agent's options and potential rewards.

In [3]:
import numpy as np

# Initialize Q-values as a Q-table of 0's 
q_table = np.zeros([env.observation_space.n, env.action_space.n])

print(f"The shape of the Q-table is: {q_table.shape} \n")

The shape of the Q-table is: (500, 6) 



Next we train and validate the agent by running it through "episodes", defined as each time the environment resets. In this case, it would be the company restarting its training program from scratch and running on a batch of new employees. Note that this takes time.

In [25]:
%%time
"""Training the agent"""
import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []
frames = []
for i in range(1, 10_001):
    state = env.reset()
    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values
        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        # Q-table update Rule 
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1
       
        state = next_state
        epochs += 1
       
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")
print("Training finished.\n")

#Visualize the process
state = env.encode(0, 2, 1, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)
print("Q-table:", q_table[state])
env.s = state
print(np.argmax(state))
env.render()

Episode: 10000
Training finished.

State: 44
Q-table: [ -2.46538759  -2.44496635  -2.41837066  -2.44931071 -10.99548104
 -10.59684768]
0
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)
Wall time: 15.7 s


Finally, we execute the policy. Notice that in comparison to when the environment was first created in the [environment notebook](https://github.com/Madison-Bunting/INDE-577/blob/main/reinforcement%20learning/1%20-%20q-learning/1-%20Reinforcement%20Learning%20Environment%20Introduction.ipynb), this agent almost never makes errors. In the context of a training program, this means the program efficiently provides employees with the fewest number of trainings to maximize their performance, and the company makes a net positive from the program.

In [30]:
"""Evaluate agent's performance after Q-learning"""
total_epochs, total_penalties = 0, 0
episodes = 10
frames = []
for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1
        # Put each rendered frame into dict for animation
        frames.append({'frame': env.render(mode='ansi'),
                       'state': state,
                       'action': action,
                       'reward': reward})
        epochs += 1
    total_penalties += penalties
    total_epochs += epochs

print(f"Results after training {episodes} employees:")
print(f"Average number of trainings each employee needed to participate in: {total_epochs / episodes}")
print(f"Average number of incorrect trainings per employee: {total_penalties / episodes}")

Results after training 10 employees:
Average number of trainings each employee needed to participate in: 12.5
Average number of incorrect trainings per employee: 0.0


In [31]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    total_cost = 0
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        #print(frame['frame'].getvalue())
        print(f"Total number of actions to date: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Company revenue due to training for that action (in thousands of dollars): {frame['reward']}")
        total_cost += frame['reward']
        print(f"To date company revenue due to training (in thousands of dollars): {total_cost}")
        sleep(.1)
        
print_frames(frames)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

Total number of actions to date: 125
State: 475
Action: 5
Company revenue due to training for that action (in thousands of dollars): 20
To date company revenue due to training (in thousands of dollars): 85


It appears that on average, employees need to participate in between 10 and 15 trainings to "level up" their skills, and the company nets $20,000-25,000 in additional revenue per employee they train.

This analysis could be used by HR partners to justify their training program. In the future, the model could additionally incorporate:
- different types of trainings
- more specific employee starting states (via a pre-test)
- the impact of combined training regimens (for example, DEI training and technical skills training)
- comparisons for the cost vs outcomes (in dollar amounts) of different training providers