# Space Discritization and Reward Augmentation

Simple Q-Learning works great in simple, discrete, environments. When it is possible to enumerate all the possible state-action pairs, and when that enumeration fits into memory. The FrozenLake example from last lab and tic-tac-toe both have relatively small state spaces that we can enumerate in memory. But Vanilla Q-learning can struggle, especially under theee circumstances:

* Large state spaces, e.g. Chess, Checkers, and Go.
* Continuous state spaces, e.g. any physics based game. 
* Sparse rewards, e.g. Atari game: Montezuma's Revenge.

In this lab we'll look at a game that has a continuous state space, the Lunar Lander game, and discuss tactic that have been used to address these weaknesses in Q-Learning, specifically:

* State augmentation — reducing the provided state information to a smaller space.
* State discritization — the same idea, but specifically for making continuous states discrete.
* Reward augmentation — rewarding the agent before it "wins" or "losses" based on the state. 

In [1]:
import numpy as np

import io
import base64
from IPython import display

import gym
from gym import wrappers

In [2]:
# This code is to embed the output into this notebook. 
# You may prefer to use the terminal directly, which will
# open a new window when you run a gym environment instead of 
# capturing a video. 
def imbed_round_video(video_env):
    video = io.open('./gym-videos/openaigym.video.%s.video000000.mp4' % video_env.file_infix, 'r+b').read()
    encoded = base64.b64encode(video)
    return display.HTML(data='''
        <video width="360" height="auto" alt="test" controls><source src="data:video/mp4;base64,{0}" type="video/mp4" /></video>'''
    .format(encoded.decode('ascii')))

In [57]:
# First, we can just make an environment from Gym
# And have the agent make a random action every time
original_env = gym.make('LunarLander-v2')

# The wrapper allows us to take a video so we can display it
# in the Jupyter notebook. 
env = wrappers.Monitor(original_env, "gym-videos/", force=True)
env.reset()

for _ in range(1000):
    # Randomly take an action
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)

    if done: break
        
# You're always supposed to close an environment when you're
# done with it in Gym. 
env.close()
original_env.close()

imbed_round_video(env)

In [58]:
# Okay, so that's what taking random actions looks like.
# Lets take a closer look at the information Gym gives us
original_env = gym.make('LunarLander-v2')
original_env.reset()

print("Actions: " , original_env.action_space)

observation, reward, done, info = env.step(0) # "Do Nothing" action
print("Observation: ", observation)
print("Reward: ", reward)
print("done: ", done)
print("Info: ", info)

original_env.close()

Actions:  Discrete(4)
Observation:  [ 0.8760195   0.1425406   1.3245243   0.15985654 -1.8934994  -0.62504584
  0.          0.        ]
Reward:  -100
done:  True
Info:  {}


Okay, that's not completely enlightening. Here's what the documentation has to say about this environment:
    
"Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine."

So, actions:

```
0: Do nothing  
1: Fire left engine  
2: Fire main engine  
3: Fire right engine  
```

And we can only take one of these actions per frame. Lets gut check:

In [34]:
# First, we can just make an environment from Gym
# And have the agent make a random action every time
original_env = gym.make('LunarLander-v2')

# The wrapper allows us to take a video so we can display it
# in the Jupyter notebook. 
env = wrappers.Monitor(original_env, "gym-videos/", force=True)

env.reset()
for _ in range(1000):
    # We should just fall straight down, never use the engine
    # Or we can change this to take the other actions...
    action = 0
    observation, reward, done, info = env.step(action)

    if done: break
        
# You're always supposed to close an environment when you're
# done with it in Gym. 
env.close()
original_env.close()

imbed_round_video(env)

In [14]:
# Okay great, looks like we have a good idea about the action space.
# But what about the "observation"? Lets get the first three observations:
original_env = gym.make('LunarLander-v2')
original_env.reset()

print("Actions: " , original_env.action_space)

observation, reward, done, info = env.step(0)
print("Observation: \n", observation)

observation, reward, done, info = env.step(0)
print("Observation: \n", observation)

observation, reward, done, info = env.step(0)
print("Observation: \n", observation)

original_env.close()

Actions:  Discrete(4)
Observation: 
 [ 0.71774447 -0.24852164  0.34225827  0.01927561 -0.8898331  -0.5850002
  1.          0.        ]
Observation: 
 [ 0.72096956 -0.24758065  0.34223717 -0.00747953 -0.9329785  -0.5847961
  1.          0.        ]
Observation: 
 [ 0.7242203  -0.24713261  0.3418834  -0.03405672 -0.978378   -0.58480734
  1.          0.        ]


Unfortunately, a lot of the Gym environments are not well documented. I had to dig through the [source code](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py) for this line to figure out what the observation space was:

```python
 state = [
            (pos.x - VIEWPORT_W/SCALE/2) / (VIEWPORT_W/SCALE/2),
            (pos.y - (self.helipad_y+LEG_DOWN/SCALE)) / (VIEWPORT_H/SCALE/2),
            vel.x*(VIEWPORT_W/SCALE/2)/FPS,
            vel.y*(VIEWPORT_H/SCALE/2)/FPS,
            self.lander.angle,
            20.0*self.lander.angularVelocity/FPS,
            1.0 if self.legs[0].ground_contact else 0.0,
            1.0 if self.legs[1].ground_contact else 0.0
]
```

So, the first two values are the position of the lander (x, y). The next two values are the x,y velocity. After that the current angle of the lander, then the angular velocity. Finally, the last two values indicate whether or not the landers left and right legs are touching the ground. 

#### So, the state variable can be interpreted as:

```
[
   x_position, y_position, 
   x_velocity, y_velocity, 
   current_angle, angular_velocity,
   left_leg_grounded, right_leg_grounded
]
```

But, unfortunately for us, all but the last two these values are continuous values between -1 and 1, of which there are an infinite amount. This makes the state space essentially infinite, plus, we might never even encounter the exact same state twice, even a tiny variation in one of these variables would move us to a "new" never before seen state, so our agent will seriously struggle to learn.

### Solution: Discritize The State Variables.

We can get force the state-space into something a Q-Agent can digest by taking our continous variables and making them discrete. In this case, lets start very simple and say we want to reduce each of the continuous variables bounded from -1 to 1 into 9 buckets. We set boundaries for the values, and truncate the incoming numbers to one of the 9 "between" values. 

`[-1, -.8, -.6, -.4, -.2, 0, .2, .4, .6, .8, 1]`

.95 becomes "between .8 and 1"  
.33 becomes "between .2 and .4"  
and so on  

Doing this will still give us a large state space: `9^6 * 2 * 2`. Six variables that each have 9 possibilities, and 2 variables that can only take 2 values (the grounded leg variables). Thats `2,125,764` possible values which is a lot. But many of them will probably never be explored, and from a computer's perspective, two million is not impossible.

In [27]:
# Numpy has a built in function to discritize things
# which requires us to specify the bins as a numpy array:

# This whole game can basically be thought of as not doing anything too fast.
# so we'll discritize the range into 6 positions.
 
discritization_bins = np.array([-1.01, -.05, 0, .05, 1.01])

# This function will take in a lunar lander state as an array, and return a
# tuple representing that state. A tuple is chosen because they are hashable
# in python which lets us save some space and represent the Q-Table as a 
# dictionary instead of a 6 dimentional array. 
def discritize_lander_state(lander_state):
    discritized_vars = np.digitize(lander_state[0:6], discritization_bins)
    
    # For readability only, this likely degrades performance
    xpos, ypos, xvel, yvel, angle, angle_vel = discritized_vars
    
    # Return a tuple with the discritized vars, and the grounded/not grounded vars
    return (xpos, ypos, xvel, yvel, angle, angle_vel, lander_state[6], lander_state[7])

In [28]:
# Simple test to demonstrate the function:

state = [-1, -.9, -.001, 0, .001, 0.3850002, 1, 0]

print(discritize_lander_state(state))

(1, 1, 2, 3, 3, 4, 1, 0)


In [215]:
# So, lets try discritizing the lunar lander game and using Q-Learning!
environment = gym.make('LunarLander-v2')

# This time we're using a dict instead of a 6D array.
# This is to save space since many states will never be reached, in 
# my opinion this also makes it easier to think about the Q-Table 
# since at its heart it is a key->value mapping.

# Each state will map to an array, that array will have 4 values
# one for each possible action.
q_table = {}

# initialize the first state
state = environment.reset()
discrete_state = discritize_lander_state(state)
q_table[discrete_state] = [0.0, 0.0, 0.0, 0.0]

# Some global parameters for Q-Learning
learning_rate = 0.1 
discount_factor = 0.95
exploration_rate = 0.3

# Lets just try 5000 attempts to start. 
training_episodes = 10000

# lets also track the average reward every so often
avg_reward = 0

for current_episode_num in range(training_episodes):
    state = environment.reset()
    
    # Note, using our discritizer. 
    discrete_state = discritize_lander_state(state)
    if q_table.get(discrete_state) is None:
        q_table[discrete_state] = [0.0, 0.0, 0.0, 0.0]

    done = False
    while not done:        
        # Explore or not
        explore = np.random.random() < exploration_rate
        if explore:
            action = environment.action_space.sample()
        else:
            # If we're not exploring randomly, we need to examine the Q-table 
            # to determine the best possible action given the current state
            action = np.argmax(q_table[discrete_state])

        # Take the action, note we are discritizing again
        next_state, reward, done, _ = environment.step(action)
        discrete_next_state = discritize_lander_state(next_state)
        
        # If we've never seen this state before, make it.
        if q_table.get(discrete_next_state) is None:
            q_table[discrete_next_state] = [0.0, 0.0, 0.0, 0.0]
        
        prev_q_value = q_table[discrete_state][action]
        discounted_future_reward = discount_factor * np.max(q_table[discrete_next_state])

        q_table[discrete_state][action] = (
            prev_q_value + (learning_rate * (reward + discounted_future_reward - prev_q_value))

        )
        
        # Update the state for the next round. Note: discrete
        discrete_state = discrete_next_state
        
    # Every time we finish an episode, log the final reward:
    avg_reward += reward
    if current_episode_num % 500 == 0:
        print("Finished episode: ", current_episode_num)
        print("  Avg. Reward=", avg_reward / 500, "\n")
        avg_reward = 0
    
print("finished!")

Finished episode:  0
  Avg. Reward= -0.2 

Finished episode:  500
  Avg. Reward= -99.03532501683864 

Finished episode:  1000
  Avg. Reward= -96.36911082458043 

Finished episode:  1500
  Avg. Reward= -95.78702540286177 

Finished episode:  2000
  Avg. Reward= -96.8539772319568 

Finished episode:  2500
  Avg. Reward= -92.35281452921012 

Finished episode:  3000
  Avg. Reward= -92.6173543839266 

Finished episode:  3500
  Avg. Reward= -93.92919083447508 

Finished episode:  4000
  Avg. Reward= -95.85698773892632 

Finished episode:  4500
  Avg. Reward= -93.78365113884493 

Finished episode:  5000
  Avg. Reward= -93.82503338253244 

Finished episode:  5500
  Avg. Reward= -92.85045126806841 

Finished episode:  6000
  Avg. Reward= -95.83387749883016 

Finished episode:  6500
  Avg. Reward= -91.76531664132123 

Finished episode:  7000
  Avg. Reward= -93.00954157009691 

Finished episode:  7500
  Avg. Reward= -92.78111386391741 

Finished episode:  8000
  Avg. Reward= -91.37266756278564 



In [216]:
# -100 is the score for crashing, scores in the -90's more or less means we're crashing
# 90% of the time. Our agent mostly gets better... but doesn't end up being super good. 

# Also lets just notice the total number of states we visited:
print(len(q_table))

12849


In [219]:
# Lets see what our lander is doing, we know it hardly wins so lets 
# not get our hopes TOO high...

# Embed 20 attempts:
for _ in range(20):
    orig_environment = gym.make('LunarLander-v2')
    environment = wrappers.Monitor(orig_environment, "gym-videos/", force=True)

    # Lets visualize a single playthrough.
    state = environment.reset()
    discrete_state = discritize_lander_state(state)

    done = False
    while not done:
        action = np.argmax(q_table.get(discrete_state))
        state, reward, done, _ = environment.step(action)
        discrete_state = discritize_lander_state(state)

        # If the game finished before our max number of rounds, break out
        if done: break

    print("Final Reward: ", reward)
    environment.close()
    orig_environment.close()

    display.display(imbed_round_video(environment))

Final Reward:  -100


Final Reward:  100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  100


Final Reward:  -100


Final Reward:  -100


In [240]:
# Not great, but not awful either... It rarely wins but it plays
# in such a way that it doesn't just catastrophically lose. 

# One problem is that our agent isn't getting any feedback until the end.
# This is called the "sparse reward" problem. Our lander doesn't know if it's
# getting "closer" because crashing is always a -100 reward no matter where it 
# crashes. If it always crashes, it can't learn NOT to crash, but our random
# exploration very rarely prevents it from crashing. A vicious cycle.

# We'll try to address this with "reward augmentation", by providing additional rewards
# in the middle phases of training if the agent is getting close to a reward state.

# We know the raw agent is doing okay, so we just want to provide a couple 
# "nudges" to help the agent find the true winning states.
def compute_additional_reward(discrete_state):
    bonus_reward = 0
    
    xpos, ypos, xvel, yvel, angle, angle_vel, left_grounded, right_grounded = discrete_state
    
    # Reward being mostly flat, in the right xposition, w/ low angular vel.
    # (If we're off to either side, the agent needs to tilt to correct)
    if angle >= 2 and angle <= 3 and xpos >=2 and xpos <= 3 and angle_vel >= 2 and angle_vel <= 3:
        bonus_reward += 2
        
        # If we're flat, and in the zone, reward moving slowly.
        if yvel >= 2 and yvel <= 3 and xvel >=2 and xvel <= 3:
            bonus_reward += 3 

            # If we're mostly flat, moving slow, and in the zone
            # Give an additional reward!
            if ypos >= 2 and ypos <= 3:
                bonus_reward += 5
    
    # If we're going too fast ever, give a punishment
    if yvel <= 0 or yvel >= 5:
        bonus_reward -= 5
        
    # Punish for going pretty far left or right
    if xpos <= 0 or xpos >= 5:
        bonus_reward -= 5
        
    return bonus_reward

In [241]:
# This is similar to the same code as above, but using our augmented rewards.
# we also add some exploration rate decay, starting with more exploration and
# lowering it as rounds continue.

# So, lets try discritizing the lunar lander game and using Q-Learning!
environment = gym.make('LunarLander-v2')

# This time we're using a dict instead of a 6D array.
# This is to save space since many states will never be reached, in 
# my opinion this also makes it easier to think about the Q-Table 
# since at its heart it is a key->value mapping.

# Each state will map to an array, that array will have 4 values
# one for each possible action.
q_table = {}

# Some global parameters for Q-Learning
learning_rate = 0.1 
discount_factor = 0.95
exploration_rate = 0.3
training_episodes = 10000

# lets also track the average reward every so often
avg_reward = 0

for current_episode_num in range(training_episodes):
    state = environment.reset()
    
    # Note, using our discritizer. 
    discrete_state = discritize_lander_state(state)
    if q_table.get(discrete_state) is None:
        q_table[discrete_state] = [0.0, 0.0, 0.0, 0.0]
    
    done = False
    while not done:        
        # Explore or not
        explore = np.random.random() < exploration_rate
        if explore:
            action = environment.action_space.sample()
        else:
            # If we're not exploring randomly, we need to examine the Q-table 
            # to determine the best possible action given the current state
            action = np.argmax(q_table[discrete_state])

        # Take the action, note we are discritizing again
        next_state, reward, done, _ = environment.step(action)
        discrete_next_state = discritize_lander_state(next_state)
        
        # If we've never seen this state before, make it.
        if q_table.get(discrete_next_state) is None:
            q_table[discrete_next_state] = [0.0, 0.0, 0.0, 0.0]
        
        prev_q_value = q_table[discrete_state][action]
        discounted_future_reward = discount_factor * (np.max(q_table[discrete_next_state]))

        # This is where we apply our bonus reward!
        reward_augmentation = compute_additional_reward(discrete_state)
        q_table[discrete_state][action] = (
            prev_q_value + (learning_rate * (reward + reward_augmentation + discounted_future_reward - prev_q_value))

        )
        
        # Update the state for the next round. Note: discrete
        discrete_state = discrete_next_state
        
    # Every time we finish an episode, log the final reward.
    # Note that we're using the "raw" reward, not the augmented one for
    # reporting. 
    avg_reward += reward
    if current_episode_num % 500 == 0:
        print("Finished episode: ", current_episode_num, "exploration_rate: ", exploration_rate)
        print("  Avg. Reward=", avg_reward / 500, "\n")
        avg_reward = 0
    
print("finished!")

Finished episode:  0 exploration_rate:  0.3
  Avg. Reward= -0.2 

Finished episode:  500 exploration_rate:  0.3
  Avg. Reward= -98.40819472872506 

Finished episode:  1000 exploration_rate:  0.3
  Avg. Reward= -96.61931666790046 

Finished episode:  1500 exploration_rate:  0.3
  Avg. Reward= -94.60965575386422 

Finished episode:  2000 exploration_rate:  0.3
  Avg. Reward= -91.63004312404681 

Finished episode:  2500 exploration_rate:  0.3
  Avg. Reward= -90.48983560733754 

Finished episode:  3000 exploration_rate:  0.3
  Avg. Reward= -90.49157406188544 

Finished episode:  3500 exploration_rate:  0.3
  Avg. Reward= -90.68938350812036 

Finished episode:  4000 exploration_rate:  0.3
  Avg. Reward= -88.97590881239947 

Finished episode:  4500 exploration_rate:  0.3
  Avg. Reward= -93.97380796167852 

Finished episode:  5000 exploration_rate:  0.3
  Avg. Reward= -90.22818303676262 

Finished episode:  5500 exploration_rate:  0.3
  Avg. Reward= -89.39776760812282 

Finished episode:  600

In [242]:
# Great (sort of)! We were mostly consistent at improving our win rate
# Even with a 30% exploration rate. And our best rounds were better 
# than our previous best rounds in terms of avg. final score. 

# Sure... we still seem to be loosing about 90% of the time,
# But this is still pretty admirable, this is a genuine challenge
# for Q-Learning. We'll do better with Deep Q Learning in the next lab. 

# Embed 20 attempts:
for _ in range(20):
    orig_environment = gym.make('LunarLander-v2')
    environment = wrappers.Monitor(orig_environment, "gym-videos/", force=True)

    # Lets visualize a single playthrough.
    state = environment.reset()
    discrete_state = discritize_lander_state(state)

    done = False
    while not done:
        action = np.argmax(q_table.get(discrete_state))
        state, reward, done, _ = environment.step(action)
        discrete_state = discritize_lander_state(state)

        # If the game finished before our max number of rounds, break out
        if done: break

    print("Final Reward: ", reward)
    environment.close()
    orig_environment.close()

    display.display(imbed_round_video(environment))

Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -0.3470618527027678


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


Final Reward:  -100


In [None]:
# We can be honest... these results are still not as good as we'd like. 
# In no small part, that's because this game is a real challenge for Q-learning
# But it is possible to do better, here are some ideas to try:

# Think more carefully about discritizing the state space:
  # Does it make sense to use the same discritization range for all 6 of the continous variables? (no...)
  # We might want a finer grain on position, but our grain on velocity might be fine. 
  # We might want to further reduce the state space into something smaller, for example we could:
    # Just keep track of 3 possible angles: left-tilt, right-tilt, and neutral
    # Just keep track of 3 possible x positions: "too left" "too right" and "just right"
    # Track a finer grain of y positions from too tall to about right. 
    # ... so on ...
    
# Think more carefully about the reward function, esp. as you change the state space:
  # Augmented rewards can lead to unintended consequences!
  # Try to define rules that create a dense reward space (more feedback for the agent), 
  # but only reward internal states that really will get the agent closer to victory!