# Cart Pole Using Explicit Policy
This notebook contains the code that implement solutions to the OpenAI Gym "Cart Pole" problem using explicit policies that do not use reinforcement learning techniques.

## Overall Initialization
This section contains the constants, variables, and functions used by all of the problem solutions.

In [1]:
import gym
import math

# Constants
LEFT_ACTION:int = 0
RIGHT_ACTION:int = 1
OBS_CART_POS_IDX:int = 0
OBS_CART_VEL_IDX:int = 1
OBS_POLE_POS_IDX:int = 2
OBS_POLE_ANGVEL_IDX:int = 3
MAX_STEP_COUNT:int = 200

# Global Variables
env = gym.make('CartPole-v1')

# Function: Translate action to string
def string_from_action(the_action):
    if the_action == LEFT_ACTION:
        return "LEFT"
    else:
        return "RIGHT"

## Policy One: Counter Angular Velocity
The policy implemented in this section consists of observing the angular velocity of the pole, and selecting the action that will counter (i.e., reduce) the observed angular velocity. Thus, if the angular velocity of the pole is observed to be in the counter-clockwise direction, the action will be to slide the cart left. Conversely, if the angular velocity of the pole is observed to be in the clockwise direction, the action will be to slide the cart right. The expectation is that as the pole begins to sway in a specific direction, the cart will move to compensate.

In [2]:
def counter_angular_velocity():
    with open("./02_cart_pole_counter_angvel.csv", "w") as trace_file:
        print("Episode,Angular Vel,Action", file=trace_file)
        total_reward:float = 0.0
        for episode_idx in range(20):
            observation = env.reset()
            action:int = LEFT_ACTION if observation[OBS_POLE_ANGVEL_IDX] < 0.0 else RIGHT_ACTION
            episode_reward:float = 0.0
            done:bool = False
            step_count:int = 0
            while not done and (step_count < MAX_STEP_COUNT):
                env.render()
                angvel:float = observation[OBS_POLE_ANGVEL_IDX]
                print('{:d},{:f},{:s}'.format(episode_idx + 1, angvel, string_from_action(action)), file=trace_file)
                if angvel < 0.0:
                    action = LEFT_ACTION
                else:
                    action = RIGHT_ACTION
                observation, step_reward, done, info = env.step(action)
                episode_reward += step_reward
                step_count += 1
            print('Episode {:d} Reward: {:f}'.format(episode_idx + 1, episode_reward))
            total_reward += episode_reward
    print('Policy One Average reward: {:f}'.format(total_reward / (episode_idx + 1)))
    env.close()

counter_angular_velocity()

Episode 1 Reward: 160.000000
Episode 2 Reward: 153.000000
Episode 3 Reward: 200.000000
Episode 4 Reward: 200.000000
Episode 5 Reward: 200.000000
Episode 6 Reward: 200.000000
Episode 7 Reward: 200.000000
Episode 8 Reward: 200.000000
Episode 9 Reward: 200.000000
Episode 10 Reward: 200.000000
Episode 11 Reward: 136.000000
Episode 12 Reward: 200.000000
Episode 13 Reward: 138.000000
Episode 14 Reward: 200.000000
Episode 15 Reward: 200.000000
Episode 16 Reward: 200.000000
Episode 17 Reward: 200.000000
Episode 18 Reward: 158.000000
Episode 19 Reward: 200.000000
Episode 20 Reward: 146.000000
Policy One Average reward: 184.550000


## Policy Two: Pole Angular Position
The policy implemented in this section consists of the angle of the pole. If we consider the pole's angular range of motion as a circle with angle $0$ pointing up, whenever the angle of the pole tilts into the semicircle area defined by the angle range $[\pi,2\pi]$, the cart will push to the left. Whenver the angle of the pole tilts into the semicircle area defined by the angle range $[0,\pi]$, the cart will push to the right. The expectation is that the pole's angular position will be constrained to a narrow band, thus surviving the requisite 200 time steps.

In [3]:
def pole_angular_position():
    with open("./02_cart_pole_angular_pos.csv", "w") as trace_file:
        print("Episode,Pole Angle,Angular Vel,Action", file=trace_file)
        total_reward:float = 0.0
        for episode_idx in range(20):
            observation = env.reset()
            action:int = LEFT_ACTION
            if observation[OBS_POLE_POS_IDX] > 0.0:
                action = RIGHT_ACTION
            episode_reward:float = 0.0
            done:bool = False
            step_count:int = 0
            while not done and (step_count < MAX_STEP_COUNT):
                env.render()
                pole_ang_pos:float = observation[OBS_POLE_POS_IDX]
                if pole_ang_pos < 0.0:
                    action = LEFT_ACTION
                else:
                    action = RIGHT_ACTION
                print(
                    '{:d},{:f},{:f},{:s}'
                    .format(
                        episode_idx + 1,
                        pole_ang_pos,
                        observation[OBS_POLE_ANGVEL_IDX],
                        string_from_action(action)
                    ),
                    file=trace_file
                )
                observation, step_reward, done, info = env.step(action)
                episode_reward += step_reward
                step_count += 1
            print('Episode {:d} Reward: {:f}'.format(episode_idx + 1, episode_reward))
            total_reward += episode_reward
    print('Policy Two Average reward: {:f}'.format(total_reward / (episode_idx + 1)))
    env.close()

pole_angular_position()

Episode 1 Reward: 46.000000
Episode 2 Reward: 38.000000
Episode 3 Reward: 41.000000
Episode 4 Reward: 45.000000
Episode 5 Reward: 39.000000
Episode 6 Reward: 51.000000
Episode 7 Reward: 46.000000
Episode 8 Reward: 45.000000
Episode 9 Reward: 36.000000
Episode 10 Reward: 34.000000
Episode 11 Reward: 52.000000
Episode 12 Reward: 38.000000
Episode 13 Reward: 46.000000
Episode 14 Reward: 36.000000
Episode 15 Reward: 39.000000
Episode 16 Reward: 49.000000
Episode 17 Reward: 42.000000
Episode 18 Reward: 35.000000
Episode 19 Reward: 52.000000
Episode 20 Reward: 51.000000
Policy Two Average reward: 43.050000


## Policy Three: Pole Angular Velocity P-Loop
The policy implemented in this section consists of implementing a simplified proportional control feedback loop on the pole's angular velocity. The P-Loop will attempt to keep the pole's angular velocity at $0.0$, calculating the error between the actual velocity observed and the target velocity ($0.0$ in our case). The P-Loop is simplified because, although the signal produced will be proportional, the cart's command input is discrete. The expectation is that the P-Loop logic will do its best at maintaining angular velocity near $0.0$ for the requisite 200 time steps.

In [4]:
def p_loop_calc(actual, target, current_action):
    K_P:float = 0.75
    error:float = target - actual
    output:float = error * K_P
    # Dead zone
    if math.fabs(output) < 0.001:
        return current_action, error
    # Action based on output
    if output > 0.0:
        return LEFT_ACTION, error
    else:
        return RIGHT_ACTION, error

def pole_ang_vel_p_loop():
    with open("./02_cart_pole_angvel_p_loop.csv", "w") as trace_file:
        print("Episode,Angular Vel,Error,Action", file=trace_file)
        total_reward:float = 0.0
        for episode_idx in range(20):
            observation = env.reset()
            action, _ = p_loop_calc(observation[OBS_POLE_ANGVEL_IDX], 0.0, LEFT_ACTION)
            episode_reward:float = 0.0
            done:bool = False
            step_count:int = 0
            p_loop_err:float = 0.0
            while not done and (step_count < MAX_STEP_COUNT):
                env.render()
                pole_ang_vel:float = observation[OBS_POLE_ANGVEL_IDX]
                action, p_loop_err = p_loop_calc(pole_ang_vel, 0.0, action)
                print(
                    '{:d},{:f},{:f},{:s}'
                    .format(
                        episode_idx + 1,
                        pole_ang_vel,
                        p_loop_err,
                        string_from_action(action)
                    ),
                    file=trace_file
                )
                observation, step_reward, done, info = env.step(action)
                episode_reward += step_reward
                step_count += 1
            print('Episode {:d} Reward: {:f}'.format(episode_idx + 1, episode_reward))
            total_reward += episode_reward
    print('Policy Three Average reward: {:f}'.format(total_reward / (episode_idx + 1)))
    env.close()

pole_ang_vel_p_loop()

Episode 1 Reward: 150.000000
Episode 2 Reward: 200.000000
Episode 3 Reward: 169.000000
Episode 4 Reward: 200.000000
Episode 5 Reward: 153.000000
Episode 6 Reward: 200.000000
Episode 7 Reward: 200.000000
Episode 8 Reward: 144.000000
Episode 9 Reward: 186.000000
Episode 10 Reward: 140.000000
Episode 11 Reward: 200.000000
Episode 12 Reward: 200.000000
Episode 13 Reward: 168.000000
Episode 14 Reward: 148.000000
Episode 15 Reward: 196.000000
Episode 16 Reward: 200.000000
Episode 17 Reward: 197.000000
Episode 18 Reward: 173.000000
Episode 19 Reward: 200.000000
Episode 20 Reward: 182.000000
Policy Three Average reward: 180.300000
