# CartPole-v0
"A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center."


https://github.com/openai/gym/wiki/MountainCar-v0#solved-requirements :

## Observation
Type: Box(2)
| Num | Observation | Min   | Max  |
|-----|-------------|-------|------|
| 0   | position    | -1.2  | 0.6  |
| 1   | velocity    | -0.07 | 0.07 |

## Action
Type: Discrete(3)
| Num | Action     |
|-----|------------|
| 0   | push left  |
| 1   | no push    |
| 2   | push right |

## Reward
-1 for each time step, until the goal position of 0.5 is reached. As with MountainCarContinuous v0, there is no penalty for climbing the left hill, which upon reached acts as a wall.

## Starting State
Random position from -0.6 to -0.4 with no velocity.

## Episode Termination
The episode ends when you reach 0.5 position, or if 200 iterations are reached.

## Solved Requirements
MountainCar-v0 defines "solving" as getting average reward of -110.0 over 100 consecutive trials.

https://gym.openai.com/envs/CartPole-v0/ &  https://gym.openai.com/docs/ :

Loop: Each timestep, the agent chooses an action, the environment returns an observation and a reward. The process starts by calling reset() which returns an intitial observation:
    "on each episode, and at every step, an action is chosen based on the current state and a policy based on the action-value function Q. After taking the action, we receive a reward and arrive at the next state. This information is used to update the action-value function, and, after doing so, we make the next state our current state and follow through until we reach the final state of the final episode." - https://medium.com/@flomay/using-q-learning-to-solve-the-cartpole-balancing-problem-c0a7f47d3f9d

**Step** returns 4 values:
1. Observation is an environemnt specific object representing your observation of the environment e.g. pixel data from camera or board state in boardgame. 
2. Reward is a float. It is the amout of reward achieved by the prior action. The goal is to increase total reward. The reward scale varies between environments.
3. Done is a boolean. Whether it is time to reset the environment. Most tasks are divided into episodes which terminate when done.
4. Info (dict) provides diagnostic information for debugging.

Every **environment** comes with an action_space and an observation_space, which describe the format of valid actions and observations. The Discrete space allows a fixed range of non-negative numbers e.g. valid **actions** are either 0 or 1.  The Box space represents and n-dimensional box so valid **observations** will be an array of 4 numbers.

In CartPole, one of the actions applies force to the left and one applies force to the right

"Since this [Q learning] algorithm relies on updating a function for each existing pair of state and action, environments that have a high state-space become problematic. This is because we can approximate better the actual value of a state-action pair as we visit it more often. However, if we have many states or many actions to take, we distribute our visits among more pairs and it takes much longer to converge to the actual true values. " - https://medium.com/@flomay/using-q-learning-to-solve-the-cartpole-balancing-problem-c0a7f47d3f9d 

The action-value funciton must be updated at every step of learning. The values for hte pairs of state visited and action taken are updated. 
 
See also: https://medium.com/swlh/using-q-learning-for-openais-cartpole-v1-4a216ef237df
Read https://medium.com/@flomay/using-q-learning-to-solve-the-cartpole-balancing-problem-c0a7f47d3f9d with https://github.com/JoeSnow7/Reinforcement-Learning/blob/master/Cartpole%20Q-learning.ipynb

In [6]:
import numpy as np #for array manipulation
import gym # pull the cart pole environment from Open AI.
#import random
import time
import math
#from IPython.display import clear_output


import matplotlib.pyplot as plt
import gc
gc.disable() #Disable automatic garbage collection.

In [7]:
#Create environment
import gym
env = gym.make('CartPole-v0') 
# Every environment comes with an action_space and an observation_space. These attributes are of type Space, and they describe the format of valid actions and observations:
print(env.action_space) #> Discrete(2)
print(env.observation_space) #> Box(4,)
#The Discrete space allows a fixed range of non-negative numbers, so in this case valid actions are either 0 or 1. The Box space represents an n-dimensional box, so valid observations will be an array of 4 numbers. We can also check the Box’s bounds:
print(env.observation_space.high)  #> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)  #> array([-2.4       ,        -inf, -0.20943951,        -inf])

#You can sample from a Space or check that something belongs to it:
# from gym import spaces
# space = spaces.Discrete(8) # Set with 8 elements {0, 1, 2, ..., 7}
# x = space.sample()
# assert space.contains(x)
# assert space.n == 8


Discrete(2)
Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


In [8]:
#This will run an instance of the CartPole-v0 environment for however many timesteps, rendering the environment at each step.
#The process gets started by calling reset(), which returns an initial observation. 
for episode in range(20): #number of episodes
    observation = env.reset()
    for t in range(100): #amount of time
        env.render()
        print(observation)
        action = env.action_space.sample() # take a random action
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

[ 0.00146761 -0.01781747  0.01210361  0.01178702]
[ 0.00111126  0.17712883  0.01233935 -0.27705264]
[ 0.00465384 -0.01816697  0.0067983   0.01949644]
[ 0.0042905   0.17685683  0.00718823 -0.27103382]
[ 0.00782764 -0.01836695  0.00176755  0.02390761]
[ 0.0074603   0.1767296   0.0022457  -0.26821711]
[ 0.01099489  0.37181943 -0.00311864 -0.56019088]
[ 0.01843128  0.56698502 -0.01432246 -0.85385471]
[ 0.02977098  0.76229923 -0.03139955 -1.15100664]
[ 0.04501697  0.95781652 -0.05441969 -1.45336796]
[ 0.0641733   0.76340348 -0.08348704 -1.17817161]
[ 0.07944137  0.9595045  -0.10705048 -1.49581528]
[ 0.09863146  1.15575284 -0.13696678 -1.81991544]
[ 0.12174651  0.96239328 -0.17336509 -1.57273486]
[ 0.14099438  0.76971137 -0.20481979 -1.33875888]
Episode finished after 15 timesteps
[ 0.03988389 -0.03109785 -0.04258748  0.01306538]
[ 0.03926193 -0.22558399 -0.04232618  0.29201313]
[ 0.03475025 -0.42007769 -0.03648591  0.57105221]
[ 0.0263487  -0.22446359 -0.02506487  0.26710186]
[ 0.02185942 -