# Q-learning
This notebook contains an application of the TaxiCab reinforcement learning environment used as a metaphor for employee training programs, with the goal of maximizing profit from increased employee learning and minimizing cost of training.

In [8]:
# Import the gym environment 
import gym

# Instantiate the taxi environment 
env = gym.make("Taxi-v3")

# Reset the environment 
env.reset()

# Show the current frame of the environment 
env.render()

+---------+
|[35mR[0m: | : :G|
| :[43m [0m| : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



---

## The Taxi-v3 environment (and translation to the context of employee training)

There are 4 locations (labeled by different letters), with the goal of picking a passenger up at one location and dropping them off at another. In the case of employee training, each location represents a level of knowledge the employee has on the topic of the training. Pickup locations signify the amount of prior knowledge the employee enters the training with, and dropoff locations signify the "goal" level of knowledge the employer hopes the employee has by the end of the program. 

* **Observations**: There are 500 discrete states since there are 25 training opportunities (*i.e. taxi positions*), 5 knowledge states, including while the passenger is in-training (*i.e. locations of the passenger, including the case when the passenger is in the taxi*), and 4 goal levels of knowledge (*i.e. destination locations*). 
   
* **Employee knowledge states** (*passenger locations*):
 - 0: R(ed)
 - 1: G(reen)
 - 2: Y(ellow)
 - 3: B(lue)
 - 4: in training (*i.e. taxi*)
  
* **Goal levels of employee knowledge** (*Destinations*):
 - 0: R(ed)
 - 1: G(reen)
 - 2: Y(ellow)
 - 3: B(lue)
   
* **Actions**: There are 6 discrete deterministic actions, 4 types of training (*i.e. taxi movement in each of the 4 directions*) and employees entering/completing the training program:
 - 0: move south
 - 1: move north
 - 2: move east
 - 3: move west
 - 4: an employee received a note in their performance evaluation that they need to receive training (*i.e. pickup passenger*)
 - 5: the employee successfully completed their assigned training program (*i.e. drop off passenger*)
 
* **Rewards**: 
- Each successfully trained employee (*i.e. passenger that is dropped off at the correct location*) increases company revenue by 20,000 dollars per year.
- Each training the employee must attend (*i.e. time-step*) costs the company 1,000 dollars
- Each time the employee is incorrectly released from the training program (*i.e. illegal pick-up and drop-off actions*) costs the company 10,000 dollars

* **Rendering**:
 - blue: passenger
 - magenta: destination
 - yellow: empty taxi
 - green: full taxi
 - other letters (R, G, Y and B): locations for passengers and destinations
   
* state space is represented by: (taxi_row, taxi_col, passenger_location, destination)


---

In [15]:
import numpy as np

# Choose a random action
action = np.random.randint(6)

# Take a step
observation, reward, done, info = env.step(action)

# Show the current frame of the environment 
env.render()

print(f"\naction: {action}")

print(f"observation: {observation}")

print(f"company spending (in thousands of dollars): {reward}")

print(f"done: {done}")

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | :[43m [0m| : |
|[34;1mY[0m| : |B: |
+---------+
  (Dropoff)

action: 5
observation: 348
company spending (in thousands of dollars): -10
done: False


---

We can also manually set the state of the current ``env`` by first using the ``env.encode`` method and then setting the current state (``env.s``) to be this new state. The following code cell illustrates this option. 

---

In [16]:
# Use the env.encode method 
state = env.encode(0, 2, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print(f"State: {state}")

# Set the current state of the environmnet 
env.s = state

# Show the current frame of the environment 
env.render()

State: 48
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | :[43m [0m| : |
|[34;1mY[0m| : |B: |
+---------+
  (Dropoff)


In [29]:
env.reset()

epochs = 0
penalties = 0
reward = 0
max_iter = 150

frames = [] # for animation

done = False

while not done and epochs < max_iter:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print(f"Total number of actions (giving trainings, adding employees to the training program or releasing employees from the program): {epochs} \n")
print(f"Number of times employees were given an incorrect number of trainings: {penalties} \n")

Total number of actions (giving trainings, adding employees to the training program or releasing employees from the program): 150 

Number of times employees were given an incorrect number of trainings: 39 



In [36]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    total_cost = 0
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        #print(frame['frame'].getvalue())
        print(f"Total number of actions to date: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Company spending for that action (in thousands of dollars): {frame['reward']}")
        total_cost += frame['reward']
        print(f"To date company spending on training (in thousands of dollars): {total_cost}")
        sleep(.1)
        

print_frames(frames)

+---------+
|[43mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+
  (North)

Total number of actions to date: 150
State: 14
Action: 1
Company spending for that action (in thousands of dollars): -1
To date company spending on training (in thousands of dollars): -501


In [33]:
env.P[450]

{0: [(1.0, 450, -1, False)],
 1: [(1.0, 350, -1, False)],
 2: [(1.0, 450, -1, False)],
 3: [(1.0, 430, -1, False)],
 4: [(1.0, 450, -10, False)],
 5: [(1.0, 450, -10, False)]}