# Cart-Pole Fixed Policy #

## Imports

In [1]:
import numpy as np
import gym
import random

## Environment

In [2]:
env = gym.make("CartPole-v1")

## Policy ##

The main variables determining how we balance the pole are: the angle at which the pole is located and the speed the pole is moving at.  If the pole is too slanted or moving too fast, we must quickly correct.  However, this is only true if both the angle and speed coincide, or in other words if they have the same magnitude.  Therefore, the policy takes these two values into account to determine whether to push left or right.  If the pole is falling to the right at a certain speed, and it surpasses a certain angle, we will push the cart to the right to balance this out.  If the pole is falling to the left and has surpassed a certain angle, we will push the cart to the left.  Otherwise, we do not need to push the cart, but since there is no idling option here, we randomly select to push either left or right.

The values were selected based on trial and error.  Originally, since the speed could theoretially be infinite and the angle could be at most 24 degrees in either direction, whole integer values were selected as thresholds.  However, I found that the cart-pole is very sensitive to even the smallest of changes, with a greater emphasis on the angle of the pole, so I selected these small values, with the threshold for the pole speed doubled relative to the angle to reflect that we are allowed more leeway when it comes to the speed.

In [3]:
def get_action(pole_angle, pole_velocity):
    if(pole_velocity > 0.002 and pole_angle > 0.001):
        return 1
    elif(pole_velocity < -0.002 and pole_angle < -0.001):
        return 0
    else:
        return random.randint(0,1)

## Testing ##

In [4]:
episodes = 100
steps = 200

reward_list = []
for episode in range(1, episodes + 1):
    initial_state = env.reset()
    total_reward = 0
    cur_action = get_action(initial_state[2], initial_state[3])
    for step in range(1, steps + 1):
        next_state, reward, done, _ = env.step(cur_action)
        total_reward += reward
        if done: break
        state = next_state
        cur_action = get_action(next_state[2], next_state[3])
        #env.render()
    reward_list.append(total_reward)

env.close()

## Results ##

Based on the description in OpenAI, this problem is considered solved when the average timesteps over 100 runs is greater than or equal to 195.

In [5]:
print("Max/Avg/Min timesteps the pole was balanced for:")
print(np.max(reward_list), np.average(reward_list), np.min(reward_list))

Max/Avg/Min timesteps the pole was balanced for:
200.0 200.0 200.0
