# CartPole Using Q-Learning

[CartPole Documentation](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)

### Action Space

The action is a `ndarray` with shape `(1,)` which can take values `{0, 1}` indicating the direction of the fixed force the cart is pushed with.

| Num | Action                 |
|-----|------------------------|
| 0   | Push cart to the left  |
| 1   | Push cart to the right |

**Note**: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

### Observation Space

The observation is a `ndarray` with shape `(4,)` with the values corresponding to the following positions and velocities:

| Num | Observation           | Min                 | Max               |
|-----|-----------------------|---------------------|-------------------|
| 0   | Cart Position         | -4.8                | 4.8               |
| 1   | Cart Velocity         | -Inf                | Inf               |
| 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
| 3   | Pole Angular Velocity | -Inf                | Inf               |

**Note:** While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:

- The cart x-position (index 0) can be take values between `(-4.8, 4.8)`, but the episode terminates if the cart leaves the `(-2.4, 2.4)` range.
- The pole angle can be observed between  `(-.418, .418)` radians (or **±24°**), but the episode terminates if the pole angle is not in the range `(-.2095, .2095)` (or **±12°**)


## Q-Learning Table

| Num | Start | Step Width | Steps (Inc. 0) |
|-----|-------|------------|----------------|
| 0   | -2.4  | 0.1        | 48 + 1         |
| 1   | -4    | 0.1       | 80 + 1        |
| 2   | -12   | 1°         | 24 + 1         |
| 3   | -4    | 0.1       | 80 + 1        |


In [None]:
import gym
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def greedy(epsilon):
    
    return np.random.rand() > epsilon


def q_func(reward, q_current_value, q_forward_value):
    """Returns q_value to update q_table with"""

    q_value = q_current_value + ALPHA * (reward + GAMMA * q_forward_value - q_current_value)
    return q_value


def display(values):
    plt.plot(values)

    plt.title("Training Data")
    plt.xlabel("Episodes")
    plt.ylabel("Score")

    plt.show()


In [None]:
class QTable:
    pass

In [None]:
EPISODES = 1000
ALPHA = 0.5
GAMMA = 0.9
EPSILON = 1.0
EPSILON_DECAY = 0.999
EPSILON_MIN = 0.0001
ENV_NAME = "CartPole-v1"


class QAgent:
    def __init__(self, type="QLEARNING", env_name=ENV_NAME):
        env = gym.make(env_name)
        self.env = env
        
        self.type = type

        self.rewards = []
        self.q_table = np.zeros((env.observation_space.n, env.action_space.n))
        self.epsilon = EPSILON

    def evaluate(self, episodes=EPISODES):
        rewards = self.rewards
        q_table = self.q_table
        epsilon = self.epsilon
        
        env = self.env
        type = self.type

        for _ in range(episodes):
            state = env.reset()
            done = False

            while not done:
                if not greedy(epsilon):
                    action = env.action_space.sample()
                else:
                    action = np.argmax(q_table[state])

                forward_state, reward, done, _ = env.step(action)

                q_current = q_table[state, action]
                if type == "QLEARNING":
                    q_forward = np.max(q_table[forward_state])
                else:
                    q_forward = np.average(q_table[forward_state])

                q_table[state, action] = q_func(reward, q_current, q_forward)

                state = forward_state

            rewards.append(reward)
            epsilon *= EPSILON_DECAY
            epsilon = max(EPSILON_MIN, epsilon)

        self.rewards = rewards
        self.epsilon = epsilon
        self.q_table = q_table
        
        return rewards
