# CartPole Using Q-Learning

[CartPole Documentation](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

### Action Space

The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction of the fixed force the cart is pushed with.

| Num | Action                 |
|-----|------------------------|
| 0   | Push cart to the left  |
| 1   | Push cart to the right |

**Note**: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

### Observation Space

The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:

| Num | Observation           | Min                 | Max               |
|-----|-----------------------|---------------------|-------------------|
| 0   | Cart Position         | -4.8                | 4.8               |
| 1   | Cart Velocity         | -Inf                | Inf               |
| 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
| 3   | Pole Angular Velocity | -Inf                | Inf               |

**Note:** While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:

- The cart x-position (index 0) can be take values between `(-4.8, 4.8)`, but the episode terminates if the cart leaves the `(-2.4, 2.4)` range.
- The pole angle can be observed between  `(-.418, .418)` radians (or **±24°**), but the episode terminates if the pole angle is not in the range `(-.2095, .2095)` (or **±12°**)


## Q-Learning Table

| Num | Start | Step Width | Steps (Inc. 0) |
|-----|-------|------------|----------------|
| 0   | -2.4  | 0.1        | 48 + 1         |
| 1   | -4    | 0.01       | 800 + 1        |
| 2   | -12   | 1°         | 24 + 1         |
| 3   | -4    | 0.01       | 800 + 1        |


In [9]:
import gym
import math
import numpy as np

## Testing with all 4 observations spaces

- Not actually necessary
- Truly only need a q-table for all the angles the pole can be at

In [10]:
""" Searching for ranges of observation space """

"""
env = gym.make("CartPole-v1")

min_vals = [0, 0, 0, 0]
max_vals = [0, 0, 0, 0]

episodes = 100000
for i_ep in range(episodes):
    # print("Episode:", i_ep + 1)

    obs = env.reset()
    done = False

    while not done:
        # env.render()

        action = env.action_space.sample()

        obs, reward, done, _ = env.step(action)
        
        if done: 
            break
        
        for i in range(len(obs)):
            min_vals[i] = min(obs[i], min_vals[i])
            max_vals[i] = max(obs[i], max_vals[i])

print("Min:", min_vals)
print("Max:", max_vals)
"""

'\nenv = gym.make("CartPole-v1")\n\nmin_vals = [0, 0, 0, 0]\nmax_vals = [0, 0, 0, 0]\n\nepisodes = 100000\nfor i_ep in range(episodes):\n    # print("Episode:", i_ep + 1)\n\n    obs = env.reset()\n    done = False\n\n    while not done:\n        # env.render()\n\n        action = env.action_space.sample()\n\n        obs, reward, done, _ = env.step(action)\n        \n        if done: \n            break\n        \n        for i in range(len(obs)):\n            min_vals[i] = min(obs[i], min_vals[i])\n            max_vals[i] = max(obs[i], max_vals[i])\n\nprint("Min:", min_vals)\nprint("Max:", max_vals)\n'

In [11]:
"""
q_table_params = {
    "start": [-2.4, -4, -12, -4],
    "step": [1, 2, 0, 2], # In digits after the decimal point
    "steps": [49, 801, 25, 801],
}

q_table = np.array(
    [np.zeros((steps_val,)) for steps_val in q_table_params["steps"]], dtype="object"
)
"""

'\nq_table_params = {\n    "start": [-2.4, -4, -12, -4],\n    "step": [1, 2, 0, 2], # In digits after the decimal point\n    "steps": [49, 801, 25, 801],\n}\n\nq_table = np.array(\n    [np.zeros((steps_val,)) for steps_val in q_table_params["steps"]], dtype="object"\n)\n'

In [12]:
""" Check to make sure that all continuous observations have translated to discrete spaces """

"""
env = gym.make("CartPole-v1")

episodes = 10000
for i_ep in range(episodes):
    obs = env.reset()
    done = False

    while not done:
        action = env.action_space.sample()

        obs, reward, done, _ = env.step(action)
        
        if done:
            break # Skips adding last observation to q-table

        for i in range(len(obs)):
            step = q_table_params["step"][i]
            
            power = 10 ** step
            
            if i == 2:
                val = math.degrees(obs[i])
            else:
                val = obs[i]
            
            index = int(round(val, step) * power)
            q_table[i][index] += 1

print(q_table[0])
"""

'\nenv = gym.make("CartPole-v1")\n\nepisodes = 10000\nfor i_ep in range(episodes):\n    obs = env.reset()\n    done = False\n\n    while not done:\n        action = env.action_space.sample()\n\n        obs, reward, done, _ = env.step(action)\n        \n        if done:\n            break # Skips adding last observation to q-table\n\n        for i in range(len(obs)):\n            step = q_table_params["step"][i]\n            \n            power = 10 ** step\n            \n            if i == 2:\n                val = math.degrees(obs[i])\n            else:\n                val = obs[i]\n            \n            index = int(round(val, step) * power)\n            q_table[i][index] += 1\n\nprint(q_table[0])\n'

## Q-Learning Agent Implementation

In [27]:
ALPHA = 0.5 # Learning rate -> how fast the values are propagated throughout the q-table
GAMMA = 0.9 # Discount factor -> how much the rewards are discounted from future steps
EPSILON = 0.1 # Greedy function -> exploration vs exploitation -> chance for the agent to make an educated guess

REWARD_SHAPING = -1000

ENV_NAME = "CartPole-v1"

[Generating Docstrings](https://queirozf.com/entries/python-docstrings-reference-examples#:~:text=Python%20Docstrings%3A%20Reference%20%26%20Examples%201%20ReStructuredText%20%28reST%29,description%20of%20function.%20...%204%20Doctest%20Permalink.%20)

In [24]:
def discrete(radians):
    return math.degrees(radians)

def greedy():
    return np.random.rand() > EPSILON

def q_func(reward, q_current_value, q_forward_value):
    """Returns q_value to update q_table with"""
    
    q_value = ALPHA * (reward + GAMMA * q_forward_value - q_current_value)
    return q_value

In [31]:
class QAgent:
    def __init__(self):
        self.q_table = np.zeros((29, 2))
        
        self.env = None
        self.make()

    def _forward(self, observation):
        pass
    
    def evaluate(self, render=False):
        env = self.env 
        
        obs = env.reset()
        done = False
        
        for i in range(200): 
            if render: 
                env.render()
                
            action = self._forward(obs)
            
            obs, rew, done, _ = env.step(action) 

        print(done)
            
    
    def make(self):
        if self.env is None: 
            self.env = gym.make(ENV_NAME)
    
    def close(self):
        if self.env is not None: 
            self.env.close()
            self.env = None
        
    


In [32]:
a = QAgent()
a.evaluate()

AssertionError: None (<class 'NoneType'>) invalid