## Nico KnÃ¼nz

# Policy Gradient Exercises

**NOTICE:**
1. You are allowed to work in groups of up to three people but **have to document** your group's\
 members in the top cell of your notebook.
2. **Comment your code**, explain what you do (refer to the slides). It will help you understand the topics\
 and help me understand your thinking progress. Quality of comments will be graded.
3. **Discuss** and analyze your results, **write-down your learnings**. These exercises are no programming\
 exercises it is about learning and getting a touch for these methods. Such questions might be asked in the\
 final exams.
 4. Feel free to **experiment** with these methods. Change parameters think about improvements, write down\
 what you learned. This is not only about collecting points for the final grade, it is about understanding\
  the methods.

In [1]:
import os
os.environ["MUJOCO_PATH"] = "/home/nico/.mujoco/mujoco-2.3.0"
os.environ["MUJOCO_PLUGIN_PATH"] = "/home/nico/.mujoco/mujoco-2.3.0/plugin"

# If you run on google-colab you have to install this package whenever you start a kernel
#
!pip install gymnasium
!pip install mujoco==2.3.0



### Exercise 1 - REINFORCE

**Summary:** Implement the REINFORCE algorithm and use it to solve the ```CartPole-v1``` environment.


**Provided Code:** Feel free to re-use code from previous exercises.


**Your Tasks in this exercise:**
1. Implement REINFORCE
2. Solve the ```CartPole-v1``` environment.
    


In [2]:
import gymnasium as gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input

env = gym.make("CartPole-v1")

gamma = 0.99
alpha = 1e-3

# create a policy network
def create_policy_net(state_dim, action_dim):
    model = Sequential([
        Input(shape=(state_dim,)),
        Dense(32, activation='relu'),
        Dense(32, activation='relu'),
        Dense(action_dim, activation='softmax')  # output probabilities for action
    ])
    return model

# instantiate the policy network
policy_net = create_policy_net(4, 2)
optimizer = tf.keras.optimizers.Adam(alpha)

# function to generate an episode using the current policy
def generate_episode(policy_net):
    # create lists to store states, actions, rewards
    states, actions, rewards = [], [], []

    # reset the environment
    s, _ = env.reset()
    # loop until the episode is done
    while True:
        # get action probabilities from the policy network
        s_tensor = tf.convert_to_tensor([s], dtype=tf.float32)
        # sample an action according to the porbabilities
        probs = policy_net(s_tensor)[0].numpy()
        a = np.random.choice(len(probs), p=probs)

        # take the action in the environment
        s_next, r, terminated, truncated, _ = env.step(a)

        # store the transition
        states.append(s)
        actions.append(a)
        rewards.append(r)

        # check if episode is done
        if terminated or truncated:
            break

        # move to the next state
        s = s_next

    # return the lists
    return states, actions, rewards

# function to compute returns
def compute_returns(rewards, gamma):
    # start with an array of zeros
    G = np.zeros(len(rewards))

    # compute the returns backwards and store them in G
    running_sum = 0
    for t in reversed(range(len(rewards))):
        running_sum = rewards[t] + gamma * running_sum
        G[t] = running_sum
    return G

# functiono to update the policy network
def reinforce_update(policy_net, states, actions, returns):
    # convert to tensors
    states = tf.convert_to_tensor(states, dtype=tf.float32)
    actions = tf.convert_to_tensor(actions, dtype=tf.int32)
    returns = tf.convert_to_tensor(returns, dtype=tf.float32)

    with tf.GradientTape() as tape:
        # compute the loss
        probs = policy_net(states)
        action_masks = tf.one_hot(actions, depth=2)
        selected_probs = tf.reduce_sum(probs * action_masks, axis=1)

        # compute log probabilities
        log_probs = tf.math.log(selected_probs + 1e-8)

        # compute the loss as negative expected return
        loss = -tf.reduce_mean(log_probs * returns)  

    # compute gradients and apply them
    grads = tape.gradient(loss, policy_net.trainable_variables)
    optimizer.apply_gradients(zip(grads, policy_net.trainable_variables))
    
solved = False

# training loop
for episode in range(2000):
    # generate an episode
    states, actions, rewards = generate_episode(policy_net)
    # compute returns for the episode
    returns = compute_returns(rewards, gamma)

    # update the policy network
    reinforce_update(policy_net, states, actions, returns)

    # evaluate the policy every 25 episodes
    if episode % 25 == 0:
        total = 0
        # run 10 evalution episode to get average reward
        for _ in range(10):
            s,_ = env.reset()
            ep_reward = 0
            while True:
                # select the action with highest probability
                s_tensor = tf.convert_to_tensor([s], dtype=tf.float32)
                probs = policy_net(s_tensor)[0].numpy()
                a = np.argmax(probs)

                # take the action in the environment
                s, r, terminated, truncated,_ = env.step(a)
                # accumulate reward
                ep_reward += r
                if terminated or truncated:
                    break
            # accumulate total reward over 10 episodes
            total += ep_reward

        # compute average reward
        avg = total / 10
        print(f"Episode {episode}, Avg Reward: {avg}")

        if avg >= 500:
            print("==== Solved! ====")
            break

2026-02-03 00:05:45.860303: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0000 00:00:1770073547.304691   19780 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3721 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:07:00.0, compute capability: 7.5


Episode 0, Avg Reward: 108.5
Episode 25, Avg Reward: 123.0
Episode 50, Avg Reward: 57.6
Episode 75, Avg Reward: 57.1
Episode 100, Avg Reward: 51.1
Episode 125, Avg Reward: 41.6
Episode 150, Avg Reward: 43.2
Episode 175, Avg Reward: 49.7
Episode 200, Avg Reward: 72.7
Episode 225, Avg Reward: 60.1
Episode 250, Avg Reward: 61.1
Episode 275, Avg Reward: 53.4
Episode 300, Avg Reward: 55.7
Episode 325, Avg Reward: 58.7
Episode 350, Avg Reward: 50.9
Episode 375, Avg Reward: 58.1
Episode 400, Avg Reward: 58.3
Episode 425, Avg Reward: 95.4
Episode 450, Avg Reward: 132.5
Episode 475, Avg Reward: 328.3
Episode 500, Avg Reward: 270.2
Episode 525, Avg Reward: 179.7
Episode 550, Avg Reward: 407.8
Episode 575, Avg Reward: 500.0
==== Solved! ====


### Exercise 2 - Deep Deterministic Policy Gradient (DDPG)

**Summary:** Implement the DDPG algorithm and use it to solve the ```Pusher-v4``` environment. If the   
physics do not work as supposed , you might have to explicitly install mujoco version 2.3.0.


**Provided Code:** Feel free to re-use code from previous exercises. Below I have provided you with   
an implementation for soft weight-updates using keras.


**Your Tasks in this exercise:**
1. Implement DDPG
2. Solve the ```Pusher-v4``` environment.
    

In [1]:
def update_target_weights(source, target, tau=0.99):
    ''' Performs a soft update as:
        target <- tau * tar + (1-tau) * src
        This is the other way as in our previous implementation following the DDPG paper.
    '''
    for i in range(len(source.layers)):

        layer_weights_list_source = source.layers[i].get_weights()
        layer_weights_list_target = target.layers[i].get_weights()

        new_weights = []
        for (w_src, w_target) in zip(layer_weights_list_source, layer_weights_list_target):
            w_target = w_target* tau + (1.0-tau) * w_src
            new_weights.append(w_target)

        target.layers[i].set_weights(new_weights)

In [7]:
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np

# create actor - input is state_dim, output is action_dim
def build_actor(state_dim, action_dim, action_high):
    inputs = layers.Input(shape=(state_dim,))
    x = layers.Dense(256, activation="relu")(inputs)
    x = layers.Dense(256, activation="relu")(x)
    outputs = layers.Dense(action_dim, activation="tanh")(x)
    outputs = outputs * action_high  # we can scale this to the action range
    return tf.keras.Model(inputs, outputs)

# create ciritc - input is state_dim + action_dim, output is Q-value
def build_critic(state_dim, action_dim):
    s_in = layers.Input(shape=(state_dim,))
    a_in = layers.Input(shape=(action_dim,))
    
    x = layers.Concatenate()([s_in, a_in]) # concat state and action
    x = layers.Dense(256, activation="relu")(x)
    x = layers.Dense(256, activation="relu")(x)
    q = layers.Dense(1)(x) # output Q-Value

    return tf.keras.Model([s_in, a_in], q)

# replay buffer to store
class ReplayBuffer:
    def __init__(self, max_size, state_dim, action_dim):
        self.max_size = max_size
        self.ptr = 0
        self.size = 0
        
        # create the storage arrays
        self.s = np.zeros((max_size, state_dim))
        self.a = np.zeros((max_size, action_dim))
        self.r = np.zeros((max_size, 1))
        self.s2 = np.zeros((max_size, state_dim))
        self.d = np.zeros((max_size, 1))

    def store(self, s, a, r, s2, d):
        # store the transition at the current pointer
        self.s[self.ptr] = s
        self.a[self.ptr] = a
        self.r[self.ptr] = r
        self.s2[self.ptr] = s2
        self.d[self.ptr] = d
        
        # move the pointer (ring buffer)
        self.ptr = (self.ptr + 1) % self.max_size
        self.size = min(self.size + 1, self.max_size)

    def sample(self, batch_size):
        # random indices and return them
        idx = np.random.choice(self.size, batch_size)
        return (
            self.s[idx],
            self.a[idx],
            self.r[idx],
            self.s2[idx],
            self.d[idx],
        )

@tf.function
def train_step(actor, critic,
               actor_target, critic_target,
               actor_optimizer, critic_optimizer,
               batch, gamma):

    # unpack the batch
    s, a, r, s2, d = batch

    
    with tf.GradientTape() as tape:
        a2 = actor_target(s2) # next action from target actor
        q_target = critic_target([s2, a2]) # target Q-value
        y = r + gamma * (1.0 - d) * q_target # compute target
        q = critic([s, a]) # current Q-value
        critic_loss = tf.reduce_mean((q - y)**2) # compute critic loss

    critic_grads = tape.gradient(critic_loss, critic.trainable_variables) # compute gradients
    critic_optimizer.apply_gradients(zip(critic_grads, critic.trainable_variables)) # apply gradients

    with tf.GradientTape() as tape:
        actions = actor(s) # actions from current actor
        actor_loss = -tf.reduce_mean(critic([s, actions])) # compute actor loss

    actor_grads = tape.gradient(actor_loss, actor.trainable_variables) # compute actor gradients
    actor_optimizer.apply_gradients(zip(actor_grads, actor.trainable_variables)) # apply actor gradients

# evaluate the policy over a number of episodes
def evaluate_policy(env, actor, episodes=10):
    total_reward = 0.0

    for _ in range(episodes):
        s, _ = env.reset()
        ep_reward = 0.0
        done = False

        while not done:
            s_tensor = tf.convert_to_tensor(s[None], dtype=tf.float32)
            a = actor(s_tensor)[0].numpy()
            a = np.clip(a, env.action_space.low, env.action_space.high)

            s, r, terminated, truncated, _ = env.step(a)
            done = terminated or truncated
            ep_reward += r

        total_reward += ep_reward

    return total_reward / episodes

In [8]:
import gymnasium as gym

env = gym.make("Pusher-v4")

# get dimensions from env
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
action_high = env.action_space.high[0]

# create networks
actor = build_actor(state_dim, action_dim, action_high)
critic = build_critic(state_dim, action_dim)

# create target networks
actor_target = build_actor(state_dim, action_dim, action_high)
critic_target = build_critic(state_dim, action_dim)

# init the weights
actor_target.set_weights(actor.get_weights())
critic_target.set_weights(critic.get_weights())

actor_optimizer = tf.keras.optimizers.Adam(1e-4)
critic_optimizer = tf.keras.optimizers.Adam(1e-3)

# create replay buffer
buffer = ReplayBuffer(1_000_000, state_dim, action_dim)

# hyperparameters
gamma = 0.99
batch_size = 1024


for episode in range(100000):
    noise_std = max(0.1, 0.3 * (1 - episode / 50_000))
    s, _ = env.reset()
    done = False

    while not done:
        # action with noise for exploration
        a = actor(s[None]).numpy()[0]
        a += np.random.normal(0, noise_std, size=action_dim)
        a = np.clip(a, env.action_space.low, env.action_space.high) # clip action to valid range

        # take action in env
        s2, r, terminated, truncated, _ = env.step(a)
        done = terminated or truncated

        # store transition in replay buffer
        buffer.store(s, a, r, s2, float(done))
        s = s2 # move to next state

        # train if we have enough samples for a batch
        if buffer.size >= batch_size:
            batch = buffer.sample(batch_size) # sample a batch
            batch = [tf.convert_to_tensor(x, dtype=tf.float32) for x in batch] # convert to tensors

            # train step
            train_step(
                actor, critic,
                actor_target, critic_target,
                actor_optimizer, critic_optimizer,
                batch, gamma
            )

            # soft update target networks
            update_target_weights(actor, actor_target)
            update_target_weights(critic, critic_target)

    # evaluate the policy every 25 episodes
    if episode % 25 == 0:
        avg_reward = evaluate_policy(env, actor, episodes=10)
        print(f"Episode {episode}, Avg Eval Reward: {avg_reward:.2f}")

        # best theoretical possible is 0. try to get close to that. Chatgpt says bigger than -25 is good.
        if avg_reward > -25:
            print("==== Policy is performing well! ====")


Episode 0, Avg Eval Reward: -83.42
Episode 25, Avg Eval Reward: -69.65
Episode 50, Avg Eval Reward: -84.22
Episode 75, Avg Eval Reward: -205.12
Episode 100, Avg Eval Reward: -168.78
Episode 125, Avg Eval Reward: -245.63
Episode 150, Avg Eval Reward: -185.96
Episode 175, Avg Eval Reward: -242.61
Episode 200, Avg Eval Reward: -197.02
Episode 225, Avg Eval Reward: -205.68
Episode 250, Avg Eval Reward: -150.68
Episode 275, Avg Eval Reward: -216.60
Episode 300, Avg Eval Reward: -204.50
Episode 325, Avg Eval Reward: -161.54
Episode 350, Avg Eval Reward: -197.59
Episode 375, Avg Eval Reward: -199.32
Episode 400, Avg Eval Reward: -186.98
Episode 425, Avg Eval Reward: -172.55
Episode 450, Avg Eval Reward: -218.54
Episode 475, Avg Eval Reward: -176.08
Episode 500, Avg Eval Reward: -172.33
Episode 525, Avg Eval Reward: -192.98
Episode 550, Avg Eval Reward: -186.56
Episode 575, Avg Eval Reward: -164.45
Episode 600, Avg Eval Reward: -171.14
Episode 625, Avg Eval Reward: -179.38
Episode 650, Avg Eva

KeyboardInterrupt: 

In [9]:
import gymnasium as gym
from IPython import display
import matplotlib.pyplot as plt

env = gym.make("Pusher-v4", render_mode="rgb_array")
s_t, _ = env.reset()

done = False
frames = []

while not done:
    # Select action from policy (actor outputs continuous action)
    # ensure observation array dtype matches the model expectation
    a_t = actor_target(np.array([s_t], dtype=np.float32))[0].numpy().squeeze()
    
    s_tplus1, r_tplus1, terminated, truncated, info = env.step(a_t)
    done = terminated or truncated
    
    # Get the rendered frame
    frame = env.render()
    frames.append(frame)
    
    s_t = s_tplus1

env.close()

# Display frames inline as an animation
import matplotlib.animation as animation

fig = plt.figure()
patch = plt.imshow(frames[0])

def animate(i):
    patch.set_data(frames[i])

ani = animation.FuncAnimation(fig, animate, frames=len(frames), interval=30)
plt.close()
display.display(ani)



<matplotlib.animation.FuncAnimation at 0x792cac064b60>

In [10]:
import imageio

imageio.mimsave('pusher.gif', frames, fps=30)