# Mountain Car Using Explicit Policy

The contents of this notebook implement several solutions to the OpenAI Gym "Mountain Car" problem using explicit, hard-coded policies that do not utilize reinforcement learning techniques.

## Overall Setup
This section contains the common setup code for all the policy implementations, as well as common utilities.

In [1]:
import gym
import math

# Constants
REVERSE_ACTION:int = 2
FORWARD_ACTION:int = 0
OBS_POS_IDX:int = 0
OBS_VEL_IDX:int = 1

# Global Variables
env = gym.make('MountainCar-v0')

# Function: Toggle between forward and reverse actions
def toggle_action(current_action):
    if current_action == FORWARD_ACTION:
        return REVERSE_ACTION
    else:
        return FORWARD_ACTION

def string_from_action(the_action):
    if the_action == FORWARD_ACTION:
        return 'FORWARD'
    else:
        return 'BACKWARD'

def string_from_bool(the_bool, true_string='True', false_string='False'):
    if the_bool:
        return true_string
    else:
        return false_string


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


## Policy One: Rapid Switch Back and Forth
The policy implemented in this section will simply engage motion in the reverse direction up until it detects that cart speed is slower than 10<sup>-3</sup>. At that point, the policy will engage motion in the forward direction, and the process will repeat until the mountain car reaches the goal. The expectation of this policy is that the rapid back and forth motion will eventually let the cart build up momentum, helping it reach its goal. 

In [2]:
# Function: Rapid switch back and forth based on speed
def rapid_speed_based_switch():
    with open("./01_mountain_car_rapid_speed_switch.csv", "w") as trace_file:
        print("Episode,Speed,Action", file=trace_file)
        total_reward:float = 0.0
        for episode_idx in range(20):
            observation = env.reset()
            action:int = REVERSE_ACTION
            episode_reward:float = 0.0
            done:bool = False
            speed:float = 0.0
            while not done:
                env.render()
                speed = math.fabs(observation[OBS_VEL_IDX])
                print("{:d},{:f},{:s}".format(episode_idx + 1, speed, string_from_action(action)), file=trace_file)
                if speed <= 1e-3:
                    action = toggle_action(action)
                observation, step_reward, done, info = env.step(action)
                episode_reward += step_reward
            print('Episode {:d} Reward: {:f}'.format(episode_idx + 1, episode_reward))
            total_reward += episode_reward

    print('Policy One Average reward {:f}'.format(total_reward / (episode_idx + 1)))
    env.close()

rapid_speed_based_switch()

Episode 1 Reward: -117.000000
Episode 2 Reward: -88.000000
Episode 3 Reward: -116.000000
Episode 4 Reward: -120.000000
Episode 5 Reward: -116.000000
Episode 6 Reward: -85.000000
Episode 7 Reward: -109.000000
Episode 8 Reward: -89.000000
Episode 9 Reward: -103.000000
Episode 10 Reward: -116.000000
Episode 11 Reward: -115.000000
Episode 12 Reward: -160.000000
Episode 13 Reward: -114.000000
Episode 14 Reward: -177.000000
Episode 15 Reward: -122.000000
Episode 16 Reward: -88.000000
Episode 17 Reward: -87.000000
Episode 18 Reward: -87.000000
Episode 19 Reward: -116.000000
Episode 20 Reward: -122.000000
Policy One Average reward -112.350000


## Policy Two: Acceleration-Based Switching
The policy implemented in this section is an enhancement over the policy from the previous section in that it measures acceleration instead of speed. This policy does not discard the direction component of the vector, like the previous one did. The policy starts by accelerating in the initial direction of motion, as indicated by the velocity component of the initial observation. Once the logic detects that acceleration has fallen below 10<sup>-3</sup>, it will change direction. The policy will then wait to observe an increase in acceleration before even considering an action switch.

In [5]:
# Function: Acceleration based switching
def accel_based_switch():
    with open("./01_mountain_car_accel_switch.csv", "w") as trace_file:
        print('Episode,Velocity,Accel,SwOK,Action', file=trace_file)
        total_reward:float = 0.0
        for episode_idx in range(20):
            observation = env.reset()
            action:int = REVERSE_ACTION if observation[OBS_VEL_IDX] < 0 else FORWARD_ACTION
            episode_reward:float = 0.0
            done:bool = False
            accel:float = 0.0
            last_vel:float = 0.0
            ok_to_switch:bool = True
            while not done:
                print(
                    '{:d},{:f},{:f},{:s},{:s}'
                    .format(
                        episode_idx + 1,
                        observation[OBS_VEL_IDX],
                        accel,
                        string_from_bool(
                            ok_to_switch,
                            true_string='YES',
                            false_string='NO'
                        ),
                        string_from_action(action)
                    ),
                    file=trace_file
                )
                env.render()
                accel = math.fabs(observation[OBS_VEL_IDX]) - math.fabs(last_vel)
                if ok_to_switch and accel < 1e-3:
                    action = toggle_action(action)
                    ok_to_switch = False
                elif not ok_to_switch and accel > 1e-3:
                    ok_to_switch = True
                observation, step_reward, done, info = env.step(action)
                episode_reward += step_reward
            print('Episode {:d} Reward: {:f}'.format(episode_idx + 1, episode_reward))
            total_reward += episode_reward
    print('Policy Two Average reward {:f}'.format(total_reward / (episode_idx + 1)))
    env.close()

accel_based_switch()

Episode 1 Reward: -118.000000
Episode 2 Reward: -119.000000
Episode 3 Reward: -122.000000
Episode 4 Reward: -118.000000
Episode 5 Reward: -113.000000
Episode 6 Reward: -119.000000
Episode 7 Reward: -115.000000
Episode 8 Reward: -113.000000
Episode 9 Reward: -114.000000
Episode 10 Reward: -122.000000
Episode 11 Reward: -119.000000
Episode 12 Reward: -118.000000
Episode 13 Reward: -113.000000
Episode 14 Reward: -117.000000
Episode 15 Reward: -121.000000
Episode 16 Reward: -118.000000
Episode 17 Reward: -113.000000
Episode 18 Reward: -113.000000
Episode 19 Reward: -117.000000
Episode 20 Reward: -118.000000
Policy Two Average reward -117.000000
