# Setup

Import the necessary libraries and create the CartPole environment.

In [32]:
import gym
import numpy as np

# Initialize the gym environment
env = gym.make('CartPole-v1', render_mode="human")
# Display the number of actions the agent can take
print(env.action_space.n)

2


  and should_run_async(code)
  deprecation(
  deprecation(


# Implementing Q-Learning

For this task, I will implement the Q-Learning algorithm, since it is one of the simplest RL algorithms and the CartPole project is an appropriately simple task. It is a model-free RL architecture, so the agent actually learns policies directly. "Q-Learning uses previously learned 'states' which have been explored to consider future moves and stores this information in a “Q-Table.” For every action taken from a state, the policy table, Q table, has to include a positive or negative reward" (Fakhry, 2020 par. 3).

First, some hyperparameters can be defined (these may be tweaked later as necessary). It is also necessary to define a function for getting discrete states for use with the RL algorithm:

In [33]:
# Hyperparameters
ALPHA = 0.1       # learning rate
GAMMA = 0.99      # discount factor
EPSILON = 1.0     # exploration rate
EPSILON_DECAY = 0.995
MIN_EPSILON = 0.01
EPISODES = 500   # number of episodes to train

# Discretize the continuous state space
def discretize_state(state, bins):
    discretized_state = []
    for i in range(len(state)):
        # Set upper and lower bounds to prevent overflow
        high = np.minimum(env.observation_space.high[i], 1e5)
        low = np.maximum(env.observation_space.low[i], -1e5)

        # Scale the state variable to [0, bins-1]
        scaling = (state[i] - low) / (high - low)
        new_state = int(np.round(scaling * (bins - 1)))
        new_state = np.clip(new_state, 0, bins - 1)
        discretized_state.append(new_state)
    return tuple(discretized_state)

Now, the number of bins can be explicitly defined for the state variables. This is important for the Q-table, where every entry is an estimate of the total reward that corresponds to a combination of a state and action.

In [34]:
# Initialize Q-table
n_bins = 20
n_states = (n_bins, n_bins, n_bins, n_bins)
q_table = np.random.uniform(low=-2, high=0, size=(n_states + (env.action_space.n,)))

episode_rewards = [] # Init episode rewards

# Training

The training look is primarily seeking to update the Q-values based on reward received and the potential future rewards. When the epsilon decreases throughout the episodes, it indicates a shift toward 'exploitation', meaning the agent benefits from choosing the best action based on the Q-values rather than selecting randomly.

In [35]:
# Q-learning algorithm
for episode in range(EPISODES):
    current_state = discretize_state(env.reset(), n_bins)
    total_reward = 0 # Accumulate total reward per episode
    done = False
    while not done:
        # Epsilon-greedy action selection
        if np.random.random() < EPSILON:
            action = env.action_space.sample()  # Explore action space
        else:
            action = np.argmax(q_table[current_state])  # Exploit learned values

        next_state, reward, done, info = env.step(action)
        next_state = discretize_state(next_state, n_bins)
        total_reward += reward

        # Update Q-table
        old_value = q_table[current_state + (action,)]
        next_max = np.max(q_table[next_state])

        # Q-learning formula
        new_value = (1 - ALPHA) * old_value + ALPHA * (reward + GAMMA * next_max)
        q_table[current_state + (action,)] = new_value

        current_state = next_state

    # Appent total reward of the episode
    episode_rewards.append(total_reward)

    # Decrement epsilon
    EPSILON = max(MIN_EPSILON, EPSILON * EPSILON_DECAY)

    if episode % 100 == 0:
        average_reward = sum(episode_rewards[-100:]) / 100
        print(f"Episode: {episode}, Average Reward: {average_reward}, Epsilon: {EPSILON}")

print("Training finished.")

  if not isinstance(terminated, (bool, np.bool8)):


Episode: 0, Average Reward: 0.34, Epsilon: 0.995
Episode: 100, Average Reward: 16.84, Epsilon: 0.6027415843082742
Episode: 200, Average Reward: 12.82, Epsilon: 0.36512303261753626
Episode: 300, Average Reward: 12.49, Epsilon: 0.2211807388415433
Episode: 400, Average Reward: 19.04, Epsilon: 0.13398475271138335
Training finished.


The epsilon is indeed decreasing, exemplifying the shift from random exploration to informed exploitation. This also serves to stablize the Q-table to coverge toward an optimal decision-making policy. Furthermore, logging the average reward can track the performance of the agent over time. In my case, although the reward decreased for a few iterations, it overall increased by the final episode.

# Conclusion

The agent's performance can best be visualized by rendering the environment. The following code visualizes an episode run.

In [36]:
# To visualize one episode run
state = env.reset()
for _ in range(100):
    env.render()
    action = np.argmax(q_table[discretize_state(state, n_bins)])
    state, reward, done, _ = env.step(action)
    if done:
        break

env.close()

I used just 5 episodes in order to reduce the training time, although I realize more episodes would lead to better performance. NOTE: I realized that visually rendering the environment does not translate well to Google Colab. Ordinarily, a separate window would appear and display the environment. So, the code does not behave as expected. Overall, I enjoyed the change of scenery that this form of machine learning provided. It is definitely much different from the other techniques we have covered.

# References

Fakhry, A. (2020, November 13). *Using Q-learning for OpenAI’s Cartpole-V1*. Medium. https://medium.com/swlh/using-q-learning-for-openais-cartpole-v1-4a216ef237df


Géron, A. (2023a). *Chapter 18 – Reinforcement Learning Code*. Google Colab. https://colab.research.google.com/github/ageron/handson-ml3/blob/main/18_reinforcement_learning.ipynb#scrollTo=rF0zo2-xyK8_


Géron, A. (2023b). *Hands-On Machine Learning With Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems*. O’Reilly Media.