# Policy Gradient Exercises

**NOTICE:**
1. You are allowed to work in groups of up to three people but **have to document** your group's\
 members in the top cell of your notebook.
2. **Comment your code**, explain what you do (refer to the slides). It will help you understand the topics\
 and help me understand your thinking progress. Quality of comments will be graded.
3. **Discuss** and analyze your results, **write-down your learnings**. These exercises are no programming\
 exercises it is about learning and getting a touch for these methods. Such questions might be asked in the\
 final exams.
 4. Feel free to **experiment** with these methods. Change parameters think about improvements, write down\
 what you learned. This is not only about collecting points for the final grade, it is about understanding\
  the methods.

In [None]:
# If you run on google-colab you have to install this package whenever you start a kernel
#
!pip install gymnasium
!pip install mujoco

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: C:\Users\nicok\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable
Collecting mujoco
  Downloading mujoco-3.4.0-cp312-cp312-win_amd64.whl.metadata (42 kB)
Collecting etils[epath] (from mujoco)
  Downloading etils-1.13.0-py3-none-any.whl.metadata (6.5 kB)
Collecting glfw (from mujoco)
  Downloading glfw-2.10.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.py39.py310.py311.py312.py313.py314-none-win_amd64.whl.metadata (5.4 kB)
Collecting pyopengl (from mujoco)
  Downloading pyopengl-3.1.10-py3-none-any.whl.metadata (3.3 kB)
Collecting fsspec (from etils[epath]->mujoco)
  Downloading fsspec-2026.1.0-py3-none-any.whl.metadata (10 kB)
Collecting importlib_resources (from etils[epath]->mujoco)
  Downloading importlib_resources-6.5.2-py3-none-any.whl.metadata (3.9 kB)
Collecting zipp (from etils[epath]->mujoco)
  Downloading zipp-3.23.0-py3-none-any.whl.metadata (3.6 kB)
Downloading mujoco-3.4.0-cp312-cp312-win_amd64.whl (5.5 MB)
   -------------------------------------


[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: C:\Users\nicok\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### Exercise 1 - REINFORCE

**Summary:** Implement the REINFORCE algorithm and use it to solve the ```CartPole-v1``` environment.


**Provided Code:** Feel free to re-use code from previous exercises.


**Your Tasks in this exercise:**
1. Implement REINFORCE
2. Solve the ```CartPole-v1``` environment.
    


In [10]:
import gymnasium as gym
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input

env = gym.make("CartPole-v1")

gamma = 0.99
alpha = 1e-3

def create_policy_net(state_dim, action_dim):
    model = Sequential([
        Input(shape=(state_dim,)),
        Dense(32, activation='relu'),
        Dense(32, activation='relu'),
        Dense(action_dim, activation='softmax')  # probabilities!
    ])
    return model

policy_net = create_policy_net(4, 2)
optimizer = tf.keras.optimizers.Adam(alpha)

def generate_episode(policy_net):
    states, actions, rewards = [], [], []

    s, _ = env.reset()
    while True:
        s_tensor = tf.convert_to_tensor([s], dtype=tf.float32)
        probs = policy_net(s_tensor)[0].numpy()
        a = np.random.choice(len(probs), p=probs)

        s_next, r, terminated, truncated, _ = env.step(a)

        states.append(s)
        actions.append(a)
        rewards.append(r)

        if terminated or truncated:
            break

        s = s_next

    return states, actions, rewards

def compute_returns(rewards, gamma):
    G = np.zeros(len(rewards))
    running_sum = 0
    for t in reversed(range(len(rewards))):
        running_sum = rewards[t] + gamma * running_sum
        G[t] = running_sum
    return G

def reinforce_update(policy_net, states, actions, returns):
    states = tf.convert_to_tensor(states, dtype=tf.float32)
    actions = tf.convert_to_tensor(actions, dtype=tf.int32)
    returns = tf.convert_to_tensor(returns, dtype=tf.float32)

    with tf.GradientTape() as tape:
        probs = policy_net(states)
        action_masks = tf.one_hot(actions, depth=2)
        selected_probs = tf.reduce_sum(probs * action_masks, axis=1)

        log_probs = tf.math.log(selected_probs + 1e-8)

        loss = -tf.reduce_mean(log_probs * returns)  # minus because we maximize

    grads = tape.gradient(loss, policy_net.trainable_variables)
    optimizer.apply_gradients(zip(grads, policy_net.trainable_variables))
    
solved = False

for episode in range(2000):

    states, actions, rewards = generate_episode(policy_net)
    returns = compute_returns(rewards, gamma)

    reinforce_update(policy_net, states, actions, returns)

    if episode % 25 == 0:
        # evaluate greedy policy
        total = 0
        for _ in range(10):
            s,_ = env.reset()
            ep_reward = 0
            while True:
                s_tensor = tf.convert_to_tensor([s], dtype=tf.float32)
                probs = policy_net(s_tensor)[0].numpy()
                a = np.argmax(probs)
                s, r, terminated, truncated,_ = env.step(a)
                ep_reward += r
                if terminated or truncated:
                    break
            total += ep_reward

        avg = total / 10
        print(f"Episode {episode}, Avg Reward: {avg}")

        if avg >= 500:
            print("==== Solved! ====")
            break

KeyError: "Registering two gradient with name 'ReduceDataset'! (Previous registration was in register C:\\Users\\nicok\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python312\\site-packages\\tensorflow\\python\\framework\\registry.py:65)"

### Exercise 2 - Deep Deterministic Policy Gradient (DDPG)

**Summary:** Implement the DDPG algorithm and use it to solve the ```Pusher-v4``` environment. If the   
physics do not work as supposed , you might have to explicitly install mujoco version 2.3.0.


**Provided Code:** Feel free to re-use code from previous exercises. Below I have provided you with   
an implementation for soft weight-updates using keras.


**Your Tasks in this exercise:**
1. Implement DDPG
2. Solve the ```Pusher-v4``` environment.
    

In [None]:
def update_target_weights(source, target, tau=0.99):
    ''' Performs a soft update as:
        target <- tau * tar + (1-tau) * src
        This is the other way as in our previous implementation following the DDPG paper.
    '''
    for i in range(len(source.layers)):

        layer_weights_list_source = source.layers[i].get_weights()
        layer_weights_list_target = target.layers[i].get_weights()

        new_weights = []
        for (w_src, w_target) in zip(layer_weights_list_source, layer_weights_list_target):
            w_target = w_target* tau + (1.0-tau) * w_src
            new_weights.append(w_target)

        target.layers[i].set_weights(new_weights)