# CartPole Skating
We will use a special **simulation environment**, which will simulate the physics behind the balancing pole.
One of the most popular simulation environments for training reinforcement learning alrogithms is called a Gym, which was maintained by OpenAI.
By using this gym we can create different **environments** from a cartpole simulation to Atari games.

In [8]:
import sys
import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
import random

### Initialize a cartpole environment
To work with a cartpole balancing problem, we need to initialize corresponding environment. Each environment is associated with an:

- **Observation space** that defines the structure of information that we receive from the environment. For cartpole problem, we receive position of the pole, velocity and some other values.

- **Action space** that defines possible actions. In our case the action space is discrete, and consists of two actions - **left** and **right**.

In [9]:
# Initialize
env = gym.make('CartPole-v1', render_mode='human')
print(env.action_space)
print(env.observation_space)
print(env.action_space.sample())

Discrete(2)
Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)
0


To see how the environment works lets run a short simulation for 100 steps. At each step, we providade one of the actions to be taken - in this simulation we just randomly select an action from `action_space`

In [None]:
env.reset()

for i in range(100):
    env.render()
    env.step(env.action_space.sample())
env.close()

: 

During simulation, we need to get observations in order to decide how to act. In fact, the step function returns current observations, a reward function, and the done flag that indicates whether it makes sense to continue the simulation or not.

In [None]:
env.reset()

done = False
while not done:
    env.render()
    obs, rew, terminated, truncated, info = env.step(env.action_space.sample())
    done = terminated or truncated
    print(f"{obs} -> {rew}")
env.close()

The observation vector that is returned at each step of the simulation contains the following values:

    - Position of cart
    - Velocity of cart
    - Angle of pole
    - Rotation rate of pole

In [None]:
# get min and max value of those numbers
print(env.observation_space.low)
print(env.observation_space.high)

You may also notice that reward value on each simulation step is always 1. This is because our goal is to survive as long as possible, i.e. keep the pole to a reasonably vertical position for the longest period of time.

✅ In fact, the CartPole simulation is considered solved if we manage to get the average reward of 195 over 100 consecutive trials.

### State discretization

In Q-Learning, we need to build Q-Table that defines what to do at each state. To be able to do this, we need state to be discreet, more precisely, it should contain finite number of discrete values. Thus, we need somehow to discretize our observations, mapping them to a finite set of states.

There are a few ways we can do this:

- **Divide into bins**. If we know the interval of a certain value, we can divide this interval into a number of **bins**, and then replace the value by the bin number that it belongs to. This can be done using the numpy [`digitize`](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) method. In this case, we will precisely know the state size, because it will depend on the number of bins we select for digitalization.

✅ We can use linear interpolation to bring values to some finite interval (say, from -20 to 20), and then convert numbers to integers by rounding them. This gives us a bit less control on the size of the state, especially if we do not know the exact ranges of input values. For example, in our case 2 out of 4 values do not have upper/lower bounds on their values, which may result in the infinite number of states.

In our example, we will go with the second approach. As you may notice later, despite undefined upper/lower bounds, those value rarely take values outside of certain finite intervals, thus those states with extreme values will be very rare.


In [None]:
# function that will take the observation from our model and produce a tuple of 4 integer values
def discretize(x):
    return tuple((x / np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))

Let's also explore another discretization method using bins:

In [None]:
def create_bins(i, num):
    return np.arrange(num + 1) * (i[1] - i[0]) / num + i[0]

print("Sample bins for interval (-5, 5) with 10 bins\n", create_bins((-5, 5), 10))

ints = [(-5,5),(-2,2),(-0.5,0.5),(-2,2)] # intervals of values for each parameter
nbins = [20,20,10,10] # number of bins for each parameter
bins = [create_bins(ints[i],nbins[i]) for i in range(4)]

def discretize_bins(x):
    return tuple(np.digitize(x[i], bins[i]) for i in range(4))

Lets run a short simulation and observe those discrete environment values. Feel free to try both `discretize` and `discretize_bins` and see if there is a difference

    ✅ discretize_bins returns the bin number, which is 0-based. Thus for values of input variable around 0 it returns the number from the middle of the interval (10). In discretize, we did not care about the range of output values, allowing them to be negative, thus the state values are not shifted, and 0 corresponds to 0

In [None]:
env.reset()

done = False
while not done:
    #env.render()
    obs, rew, terminated, truncated, info = env.step(env.action_space.sample())
    done = terminated or truncated
    #print(discretize_bins(obs))
    print(discretize(obs))
env.close()