Reinforcement Learning Model Building
- shows the steps and explanations to building a RL model

Process to making a successful RL model
1. Observation of the environment
2. Deciding how to act using some strategy
3. Acting accordingly
4. Receiving a reward or penalty
5. Learning from the experiences and refining our strategy
6. Iterate until an optimal strategy is found
(EX taken from here: https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/)

In [17]:
%pip install cmake "gym[atari]" scipy
%pip install pygame

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [18]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import gym
import pygame
env = gym.make("Taxi-v3",render_mode='ansi')
env.reset() #Resets the environment and returns a random initial state.
print(env.render()) #Renders one frame of the environment
#Filled square represents taxi
#pipe represents the wall "|"
#R, G, Y, B = possible pickup & destination locations
#blue letter = current passender pick-up location
#purple letter is the current destination

+---------+
|R: | : :G|
|[43m [0m: | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+




env.step(action)
- Step the environment by one timestep. Returns

    - observation: Observations of the environment
    reward: If your action was beneficial or not
    - done: Indicates if we have successfully picked up and dropped off a passenger, also called one episode
    - info: Additional info such as performance and latency for debugging purposes

In [19]:
env.reset() # reset environment to a new, random state
env.render()

print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(6)
State Space Discrete(500)


Action Space (length = 6)
0. South
1. North
2. East
3. West
4. Pickup
5. Dropoff

State Space (500) -> correspond to a encoding of the taxi's location, passenger location, and destination location

RL will learn a mapping of states to the optimal action to perform in that state by exploration (the agent (taxi) explores the envivonment and takes actions baded off rewards defined in the environment)

In [20]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
print(env.render())

State: 328
+---------+
|R: | : :[34;1mG[0m|
| : |[43m [0m: : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+




We are using our illustration's coordinates to generate a number corresponding to a state between 0 and 499

set the environment's state manually with env.env.s

when the Taxi environment is created, there is an inital Reward table that is called 'P.' Think of it as a matrix (states x actions)

format of the matrix: {action: [(probability, nextstate, reward, done)]}

In [21]:
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

actions are 0-5

probability is always 1.

nextstate is the state we would be in if we take the action at this index of the dict

all movement actions have -1 reward and pickup/dropoff actions have -10 reward in this particular state

The following code snippets (the next two cells) display the NOT reinforcement learning method

In [22]:
env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, extra, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))


Timesteps taken: 200
Penalties incurred: 68


  if not isinstance(terminated, (bool, np.bool8)):


In [23]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

+---------+
|R: |[43m [0m: :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+
  (Dropoff)

Timestep: 200
State: 47
Action: 5
Reward: -10


Enter Reinforcement Learning within the Context of this Problem

this will showing an RL-algorithm called Q-Learning

Basis of Q-Learning:
- lets the agent use the environment's rewards to learn over time (the best action to take in a given state)

Our agent tales thousands of timestamps and makes a lot of wrong drop offs. This is because we aren't learning from past experience.

Intro to Q-Learning

- lets the agent use the environment's rewards to learn over time (the best action to take in a given state)

Within the example:

- we have a reward table `P` that the agent will learn from -> this happens by looking at the received reward for taking an action in the current state, then updating the Q-value to remember if that action was beneficial

- values store in the Q-table are called a Q-values, and they map to a (state, action) combination

- a Q-value for a particular state-action combination is representative of the "quality" of an action taken from that state (better Q-values = better chances of getting greater rewards)

![alternatvie text](images/qValuesMathematicalRep.png)

- alpha is the learning rate (values from 0 to 1), is the extend to which out Q-values are being updated in every iteration

- gamma is the discount factor (values from 0 to 1) - determined how much importance we want t give future rewards
    - high values = long-term effective award
    - low values makes our agent consider only immediate reward

Explanation of the mutation of the Q-Values:
- assgined the Q-value (agent's current state and action) by taking a weight (1-alpha) of the old Q-value and then adding the learned value
- the learned values is a combination of the reward for taking the current action in the current state and then discounted max reward from the next state we will be in once we take the current action

- the Q-value of a state-action pair is the sum of the instant reward and the discounted future reward (stored through a Q-table)

Q-Table
- a matrix where we have a row for every state (500) and a column for every action (6)
- first intialization starts with 0 and then the values are updated after training

![picture of tables](images/basisQTableandTrainedQTable.png)

Q-Learning Process Summary:

- Initialize the Q-table by all zeros.
- Start exploring actions: For each state, select any one among all possible actions for the current state (S).
- Travel to the next state (S') as a result of that action (a).
- For all possible actions from the state (S') select the one with the highest Q-value.
- Update Q-table values using the equation.
- Set the next state as the current state.
- If goal state is reached, then end and repeat the process.

After enough random explorations of actions, the Q-values tend to converge serving our agent as an action-values function which it can exploit to pick the most optimal action for a given state

To prevent "overfitting" which is this case would be an action taking the same route there will be another parameter called E "epsilon"

We might favor exploring the action space further instead of picking a Q-value action. Lower Epsilon values result in episodes with more penalities.

Let's see this in the terms of this example

In [24]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])
#makes the Q-Table

In [29]:
%%time
"""Training the agent"""
#run on rosie

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()[0]

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, extra, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")


Episode: 9900


KeyboardInterrupt: 

In [None]:
q_table[328]

The max Q-value is  "north" (-2.169)

In [None]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

![](images/qLearingcomparedtoNone.png)

Hyperparameters & Optimizations:

- alpha (learning rate): should decrease as you continue
- gamma: as you get closer to the deadline, your preference for near-term reward should increase, as you won't be around long enough to get long-term reward (should decrease as a whole)
- epsilon: as trails increase epsilon should decrease (need less exploration as you stratdgy starts to become more apparent)

Tuning the Hyperparameters

- Comprehensive Search Function that seletcs the parameters that would result in best reward/time_steps ratio
- might want to track the # of penalities corresponding to the hyperparameter value combination as well because this can also be a deciding factor