# Cart Pole

I adapt ["Actor Critic Method" by Apporv Nandan](https://keras.io/examples/rl/actor_critic_cartpole/) for the [OpenAI Gym CartPole-v1 task](https://gymnasium.farama.org/environments/classic_control/cart_pole/).

The Actor Critic method involves two components:

* The *actor* computes a probability for each action in the state space.
* The *critic* computes the sum of all rewards the agent expects to receive in the future.

The agent learns to select actions that maximize the rewards it expects it will receive.

In the CartPole-v1 task, the agent can take two actions: push cart to the left (0) and push cart to the right (1). An observation $(x, v, \theta, \omega)$ consists of position $x$, velocity $v$, pole angle $\theta$, and angular velocity $\omega$. The agent is awarded +1 for each step taken, since the goal is to keep the pole upright as long as possible.

First, the required libraries:

In [2]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import gymnasium as gym
import numpy as np
import tensorflow.keras as keras
from tensorflow.keras import ops, Model
from tensorflow.keras.layers import Dense, Input
import tensorflow as tf

In [3]:
DISCOUNT_FACTOR = 0.99
STEPS_PER_EPISODE = 10_000
env = gym.make('CartPole-v1')

The actor and the critic share an input and hidden layer:
![Diagram of Model](./Cart%20Pole%20Actor-Critic%20Model.svg)

In [4]:
NUM_INPUTS = 4
NUM_ACTIONS = 2
NUM_HIDDEN = 128
EPS = np.finfo(np.float32).eps

inputs = Input(shape=(NUM_INPUTS,))
common = Dense(NUM_HIDDEN, activation='relu')(inputs)
action = Dense(NUM_ACTIONS, activation='softmax')(common)
critic = Dense(1)(common)

model = Model(inputs=inputs, outputs=[action, critic])

In [5]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
# model.compile(optimizer=optimizer)
critic_loss = keras.losses.Huber()
running_reward = 0
episode_count = 0

while True:
    state, _ = env.reset()
    action_probs_history = []
    expected_return_history = []
    rewards_history = []
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, STEPS_PER_EPISODE):
            # env.render()

            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            action_probs, expected_return = model(state)
            action = np.random.choice(NUM_ACTIONS, p=np.squeeze(action_probs))
            action_probs_history.append(ops.log(action_probs[0, action]))
            expected_return_history.append(expected_return[0, 0])

            state, reward, done, _, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # The return at a given time step is the sum of all future
        # rewards [..., r2, r1, r0] weighted iteratively by the
        # discount factor:
        #     returns[0] = r0
        #     returns[1] = r1 + Y*r0 = r1 + Y * returns[0]
        #     returns[2] = r2 + Y*(r1 + Y*r0) = r2 + Y * returns[1]
        returns = np.zeros(len(rewards_history))
        discounted_return = 0
        for i in range(len(returns)):
            discounted_return = rewards_history[-1 - i] + DISCOUNT_FACTOR * discounted_return
            returns[-1 - i] = discounted_return

        # Normalize by computing the Z-score (x - mean) / stdev.
        returns = (returns - np.mean(returns)) / (np.std(returns) + EPS)


        # returns = []
        # discounted_sum = 0
        # for r in rewards_history[::-1]:
        #     discounted_sum = r + DISCOUNT_FACTOR * discounted_sum
        #     returns.insert(0, discounted_sum)

        # # Normalize
        # returns = np.array(returns)
        # returns = (returns - np.mean(returns)) / (np.std(returns) + EPS)
        # returns = returns.tolist()

        actor_loss = 0.
        critic_loss_ = 0.
        for action_prob, value, return_ in zip(action_probs_history, expected_return_history, returns):
            diff = return_ - value
            actor_loss -= action_prob * diff
            critic_loss_ += critic_loss(
                ops.expand_dims(value, 0),
                ops.expand_dims(return_, 0)
            )

        cost = actor_loss + critic_loss_
        grads = tape.gradient(cost, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

    episode_count += 1
    if episode_count % 10 == 0:
        print(f'Running reward at episode {episode_count}: {running_reward:.2f}')

    if running_reward > 500:
        print(f'Solved at episode {episode_count}!')
        break

Running reward at episode 10: 8.92
Running reward at episode 20: 15.75
Running reward at episode 30: 24.61
Running reward at episode 40: 36.09
Running reward at episode 50: 41.26
Running reward at episode 60: 61.43
Running reward at episode 70: 91.03
Running reward at episode 80: 93.72
Running reward at episode 90: 146.43
Running reward at episode 100: 145.74
Running reward at episode 110: 207.56
Running reward at episode 120: 245.33
Running reward at episode 130: 215.40
Running reward at episode 140: 167.05
Running reward at episode 150: 138.44
Running reward at episode 160: 154.78
Solved at episode 162!


In [6]:
model.save('cart_pole.keras')