<a href="https://colab.research.google.com/github/JanNogga/rl_ss25/blob/main/RL_Assignment_09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Robot Learning

## Assignment 9

### Solutions are due on 24.06.2025 before the lecture.

## Task 9.1)

Recall the discussion of the corresponding algorithm from the lecture and carefully read [Apprenticeship Learning via Inverse Reinforcement Learning](https://ai.stanford.edu/~ang/papers/icml04-apprentice.pdf) [Abbeel & Ng (2004)] to answer the following questions:

* (Yes/No - No reason required): Can a policy recovered by apprenticeship learning via IRL be optimal with respect to multiple reward functions?
* (Yes/No - No reason required): Can different policies lead to the same feature expectations?
* How many parameters are learned to approximate the reward function?
* How is available expert behaviour taken into account?
* What assumptions are made regarding the RL algorithm used to return optimal policies with respect to the approximated rewards?

<div style="text-align: right; font-weight:bold"> 1 + 1 + 2 + 2 + 2 = 8 Points </div>

Please answer in this text cell.

## Task 9.2)

In the following task, we will again use an environment from Gym.

If you have started your Colab session and are ready to proceed, uncomment the four lines in the code cell below. They will install everything required to simulate the environment. If prompted to restart your runtime, do so, but you don't have to repeat the installation unless you delete your runtime.

**Warning: This is unlikely to work on your own computer, and might even mess up your system! Please only use the following lines in Colab.**

In [None]:
#!apt-get -qq install xvfb x11-utils &> /dev/null
#!pip install ufal.pybox2d --quiet &> /dev/null
#!pip install pyvirtualdisplay moviepy pyglet PyOpenGL-accelerate --quiet &> /dev/null
#!pip install numpy==1.23.5 matplotlib==3.7.0

### Introduction

In this task, we examine the Cart Pole environment.

Below, you are given an expert policy which selects the better of the two actions available per state. Note that the environment features a continuous 4-D state space, which we discretize by partioning the relevant regions into $8 \times 8 \times 8 \times 8$ bins. This allows use of tabular policies such as the expert to determine the better of the 2 actions per state. Familarize yourself with the corresponding discretization helper functions, the expert policy and the environment by examing the code below, which animates the expert policy playing one episode.

In [None]:
import numpy as np
import gym
import matplotlib.pyplot as plt
from pyvirtualdisplay import Display
from moviepy.editor import VideoClip
from moviepy.video.io.bindings import mplfig_to_npimage
from tqdm import tqdm
import random
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Bin settings
NBINS = 8
NACTIONS = 2
# Dimension of the CartPole-v1 state space.
ENV_STATE_DIM =  4

### Discretization helper functions
# Helper function to create the bins
def create_bins(nbins=NBINS):
    """
    create bins to discretize the continuous observable state space
    """
    # NBINS X NBINS x NBINS x NBINS for
    # [cart_pos, cart_vel, pole_angle, pole_angular_vel]

    bins = np.zeros((ENV_STATE_DIM,nbins))
    bins[0] = np.linspace(-4.8, 4.8, nbins)
    bins[1] = np.linspace(-5, 5, nbins)
    bins[2] = np.linspace(-.418, .418, nbins)
    bins[3] = np.linspace(-5, 5, nbins)
    return bins

BINS = create_bins()

# Helper function to convert an environment state to a state_discrete containing bin indices
# You can directly access numpy arrays representing tabular Q or pi using the output
def discretize_state(state, bins=BINS):
    return tuple([np.digitize(state[i], bins[i]) - 1 for i in range(ENV_STATE_DIM)])

In [None]:
# set up showing animations from the environment in Colab.
Display(visible=False).start()
# download the expert policy
!wget https://github.com/JanNogga/rl_ss25/raw/main/pi_expert.npy

In [3]:
# load the expert policy
pi_expert = np.load('pi_expert.npy')
# Name of the environment.
ENV_NAME = 'CartPole-v1'
# Cart Pole has 2 discrete actions: [Push cart to the left, Push cart to the right]
ENV_ACTION_DIM = 1
# Create the environment
env = gym.make(ENV_NAME)
# Reset the environment
state = env.reset() # state = [cart_pos, cart_vel, pole_angle, pole_angular_vel]
# Track whether the episode is over
done = False
# List to append the frames produced by the environment renderer
frames = []
while not done:
  # Render current situation and append to frames
  frames.append(env.render('rgb_array'))
  # Select an action for the discretizied state according to the expert policy
  state_discrete = discretize_state(state)
  action = pi_expert[state_discrete]
  # Execute this action
  state, reward, done, info = env.step(action)
# Print the number of frames
print('Number of frames:', len(frames))
# Prevent the renderer from showing artifacts
plt.close()

Number of frames: 500


In [4]:
# Helper function to animate a list of frames as produced above
def visualize_trajectory(frames, fps=30):
  duration = int(len(frames) // fps + 1)
  fig, ax = plt.subplots()
  def make_frame(t, ind_max=len(frames)):
      ax.clear()
      ax.imshow(frames[min((int(fps*t),ind_max-1))])
      return mplfig_to_npimage(fig)
  plt.close()
  return VideoClip(make_frame, duration=duration)

In [5]:
# Get the animation from the frames of the played episode
animation = visualize_trajectory(frames)
# Show the animation
animation.ipython_display(fps=30, loop=True, autoplay=True)

Moviepy - Building video __temp__.mp4.
Moviepy - Writing video __temp__.mp4



                                                              

Moviepy - Done !
Moviepy - video ready __temp__.mp4




Furthermore, you are given the code skeletons of several other helper functions below. Complete them by

* Implementing $$\phi: \mathcal{S} → [0,1]^4, \phi(s)=\phi(s_{(0)},s_{(1)},s_{(2)},s_{(3)}) \mapsto (f(s_{(0)}),f(s_{(1)}),f(s_{(2)}),f(s_{(3)}))$$ where $f$ is the sigmoid function.

* Implementing *feature_expectation()* which converts a policy represented by $\pi(s)$ or $Q^{\pi}(s,a)$ to $\mu_{\pi}$. Hint: refer to eq. (5) in [Abbeel & Ng (2004)].

* Completing the RL-algorithm (standard Q-learning) to use reward estimates instead of the rewards obtained from the environment. For the sake of the exercise assume that all we know about the environment is based on observing the expert policy perform.

<div style="text-align: right; font-weight:bold"> 2 + 3 + 2 = 7 Points </div>

In [None]:
def phi(state):
    # in: a (4,) numpy array containing the continous env state
    # out: a (4,1) numpy array containing the feature vector phi(s)
    return TODO

# You don't need to change this function
def selectActionEpsGreedy(Q, state_discrete, eps = 0.1):
    # in: Q values, a discrete_state and eps
    # out: action selected by the policy induced by Q
    if np.random.uniform() < eps:
        a = np.random.choice(NACTIONS)
    else:
        a = np.argmax(Q[state_discrete])
    return a

def feature_expectation(PiorQ, num_iterations=1000, gamma=0.9):
    # in: Q, an 8x8x8x8x2 numpy array OR
    # in: pi, an 8x8x8x8 numpy array
    # out: the (4,1) feature expectation for the input policy
    return TODO

# You only need to add a few lines here, look for TODO
def rl_algorithm(w=None, num_iterations=10000, gamma=0.9, alpha=0.02):
    # in: weights w to approximate rewards
    # out: Q values learned for w, and the corresponding learning curve
    Q = np.zeros((NBINS, NBINS, NBINS, NBINS, NACTIONS))
    returns_hist = np.zeros(num_iterations)
    for i in tqdm(range(num_iterations)):
        # reset the environment
        # do this before each new episode
        state_prev = env.reset()
        state_prev_discrete = discretize_state(state_prev)
        # done is used to indicate the end of an episode
        done = False
        # evaluate the episode
        k = 0
        while not done:
            k += 1
            # get the first action of the episode
            eps = 0.8/np.sqrt(1+i)
            a = selectActionEpsGreedy(Q, state_prev_discrete, eps = eps)
            # execute action a using the step function
            state, reward, done, _ = env.step(action = a)
            # at the end of the episode returns[i] is the return of the episode i
            # you might need this for debugging, so don't overwrite the rewards before this step
            returns_hist[i] += reward
            if w is not None:
                # we consider the task solved at 200 steps
                if k > 200:
                    done = True
                # TODO, overwrite the reward using w and phi
                reward = TODO
            state_discrete = discretize_state(state)
            # get the max action a_prime
            a_prime = selectActionEpsGreedy(Q, state_discrete, eps = 0)
            # catch terminal states
            if done:
                Q_next = 0.
            else:
                Q_next = Q[state_discrete][a_prime]
            # update the Q-values, make sure you have overwritten reward by now
            Q[state_prev_discrete][a] += alpha*(reward + gamma * Q_next - Q[state_prev_discrete][a])
            # backup the current state
            state_prev_discrete = state_discrete
    return Q, returns_hist

## Task 9.3)

Use your results from the previous task to implement Apprenticeship Learning via Inverse Reinforcement Learning to estimate a reward function and find a policy which is optimal with respect to it based on observations of the expert. For ease of implementation, refer to section 3.1 in [Abbeel & Ng (2004)] to implement the algorithm using the projection method. Plot the learning curve of your final student policy and animate its behavior!

Hint: This algorithm is relatively straightforward to implement if you can get past the somewhat confusing cases for the epoch index $i$. In [Abbeel & Ng (2004)], section 3, the process is partitioned into 6 core steps. Step 1 is only used for initialization, and $i$ is set from 0 to 1 immediately after. Step 2 behaves differently for $i=1$, where $w$, $\overline{\mu}$ and $t$ append their first list elements as described towards the end of sec. 3.1. For $i\geq2$, step 2 proceeds according to the beginning of section 3.1.

<div style="text-align: right; font-weight:bold"> 5 Points </div>

In [None]:
# Your code can go here