## Environment Settings (do not change)

Please do **not** change this part.

In [None]:
!pip install gymnasium
!pip install gymnasium[other]
!pip install gymnasium[toy-text]

In [4]:
import numpy as np
from typing import List, Tuple
import gymnasium as gym

from gymnasium.wrappers import RecordVideo
from base64 import b64encode
from IPython.display import HTML

def render_mp4(videopath: str) -> str:
  """
  Gets a string containing a b4-encoded version of the MP4 video
  at the specified path.
  """
  mp4 = open(videopath, 'rb').read()
  base64_encoded_mp4 = b64encode(mp4).decode()
  return f'<video width=400 controls><source src="data:video/mp4;' \
         f'base64,{base64_encoded_mp4}" type="video/mp4"></video>'

## Basics

Setup environment:

In [5]:
# Discount factor (in [0,1))
gamma = 0.95

# Simulation
n_episodes = 200
max_length_episode = 100

# Decide whether to generate a video or not
generate_video = True # leave it to True (or delete the rendering)
video_dir = 'vid'
video_name = 'FrozenLake'
n_episodes_video = 2 # episode which end up in the video

# Environment
env = gym.make('FrozenLake-v1', map_name="4x4", is_slippery=True, render_mode='rgb_array')

# The tutorial will be on FrozenLake, but feel free to play with other enviroments too
#env = gym.make('Taxi-v3', render_mode='rgb_array')
#env = gym.make('CliffWalking-v0', render_mode='rgb_array')

States and actions (for the environment FrozenLake, 4x4):

*   The state is the position on the 4x4 grid (i.e., between 0 and 15);
*   The action is left (0), down (1), right (2), up (3).




In [6]:
env.action_space

Discrete(4)

In [7]:
env.action_space.n

np.int64(4)

In [8]:
env.observation_space

Discrete(16)

In [9]:
env.observation_space.n

np.int64(16)

Probability matrix, for instance here the probabilities when at state 0 and action 1 is played.

In [10]:
env.unwrapped.P[3][1] # p_state, state, reward of that transition, done (ignore the last output)

[(0.3333333333333333, 2, 0.0, False),
 (0.3333333333333333, 7, 0.0, True),
 (0.3333333333333333, 3, 0.0, False)]

## Our First Policy: A Random Policy

We start with the semantics: For us, a policy is Python list of length $n_\mathrm{states}$ and each entry of this list is a vector of size $n_\mathrm{actions}$. To sample an action at a given state, which will be helpful later, the following function can be used.

*Note*: this is not the most efficient way, just for illustration purposes.

In [11]:
def sample_action(policy: List[np.ndarray], state: int) -> int:
  if isinstance(state, tuple):
    state = state[0]
  return np.random.choice(np.arange(start=0, stop=env.action_space.n), size=1, p=policy[state])[0]

We start with a simply random policy. We will later experiment other policies.

For us, a policy is list of arrays. In particular:
- `pi` is list whose dimension is the number of states;
- `pi[s]` is numpy array whose dimension is the number of actions which represents a probability distribution over the action space;
- `pi[s][a]` is the probability of playing action `a` when at state `s`.




In [12]:
pi_random = []
for x in range(env.observation_space.n):
  probability_actions = np.ones(env.action_space.n)/env.action_space.n
  pi_random.append(probability_actions)

Check your result by inspecting the probability distribution at the first state.  

In [13]:
pi_random[0]

array([0.25, 0.25, 0.25, 0.25])

Finally, we test our sampling method.

In [14]:
sample_action(policy=pi_random,
              state=1)

np.int64(1)

## Simulation environment
We now write a function that simulates a policy.

In [15]:
def simulate_environment(env, policy:List[np.ndarray], sim_video_name: str) -> float:
  # Setup video
  if generate_video:
    env = RecordVideo(env, video_dir, name_prefix=sim_video_name, episode_trigger=lambda e: e < n_episodes_video)

  total_reward = 0.0

  for e in range(n_episodes):
    reward_episode = 0.0

    # Reset
    observation = env.reset()

    # Simulate an episode
    for t in range(max_length_episode):
      action = sample_action(policy=policy,
                             state=observation)
      observation, reward, terminated, _, _ = env.step(action)

      # Compute reward
      reward_episode += gamma**t * reward

      if terminated:
        break

    # Increase reward
    total_reward += reward_episode

  env.close()

  return total_reward/n_episodes

We can now simulate our random policy. The simulations parameters are listed at the top of the file.

In [16]:
average_reward_random = simulate_environment(
    env,
    policy=pi_random,
    sim_video_name=video_name + '_random_policy')
print('Average reward: ' + str(average_reward_random))

  logger.warn(
error: XDG_RUNTIME_DIR is invalid or not set in the environment.
ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
ALSA lib confmisc.c:422:(snd_func_concat) error evaluating strings
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1342:(snd_func_refer) error evaluating name
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5727:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM default


Average reward: 0.0024383748955776477


Display video:

In [17]:
for episode in range(n_episodes_video):
    vid = HTML(render_mp4(f'{video_dir}/{video_name}_random_policy-episode-{episode}.mp4'))
    display(vid)

## Value function of a policy

Now, we can write a function that computes the value/cost-to-go of a given policy. We will do it in three ways:

1.   Solving the (linear) Bellman equation:
$$V^\pi(x)=\sum_{u}\pi(x,u)\left(R_x^u+\sum_{x'}P_{xx'}^uV^\pi(x')\right);$$

2.   Using the contractivity "Bellman equation": Update $V_t^\pi$ to $V_{t+1}^\pi$ via
$$V_{t+1}^\pi(x)=\sum_u\pi(x,u)\left(R_x^u+\sum_{x'}P_{xx'}^uV_t^\pi(x')\right);$$
note that you also need to define an initial condition and an adequate stopping  criterium;

3.   Via numerical simulations.

*Note*: This is not the optimal value/cost-to-go, but just the reward/cost incurred when using the policy $\pi$. After that, we will look into methods to compute the optimal policy $\pi^\ast$.

First, let's write a function that computes the expected reward $R_x^u$ and the probability vector $P_{xx'}^u$ when playing action $u$ at state $x$.






In [None]:
def get_reward_probability_vector_state_action(state: int, action: int) -> Tuple[float, np.ndarray]:
  expected_reward = 0.0
  probability_vector = np.zeros(env.observation_space.n)

  # Extract info
  output = env.unwrapped.P[state][action]
  for o in output:
    p_next_state = o[0]
    next_state = o[1]
    reward = o[2]

    # Reward
    expected_reward += p_next_state*reward

    # Probability vector
    probability_vector[next_state] += p_next_state

  return expected_reward, probability_vector

We can now start with 1. We split the task in two pieces:


1.   Implement a function `get_reward_probability_matrix` that outputs the reward vector (numpy array whose dimension is the number of states) whose entry $x$ is
$$\sum_{u}\pi(x,u)R_x^u$$
and the probability matrix (two-dimensional numpy array whose dimension is the number of states) whose entry $(x,x')$ is
$$\sum_{u}\pi(x,u)P_{xx'}^u.$$
2.   Use these two quantities to solve the linear equation. The output should be $V(x)$ as a numpy array (whose dimension is the number of states).

*Hint:* Use the function `get_reward_probability_vector_state_action` you wrote above.



In [None]:
def get_reward_probability_matrix(policy: List[np.ndarray]) -> Tuple[np.ndarray, np.ndarray]:
  expected_reward = np.zeros(env.observation_space.n)
  probability_matrix = np.zeros((env.observation_space.n, env.observation_space.n))

  for state in range(env.observation_space.n):
    for action in range(env.action_space.n):
      # Probability of playing that action
      p_action = policy[state][action]
      # Reward and probability vector
      expected_reward_x_u, p_vector_x_u = get_reward_probability_vector_state_action(state=state,
                                                                                     action=action)
      # Reward
      expected_reward[state] += p_action*expected_reward_x_u
      # Probability matrix
      probability_matrix[state, :] += p_action*p_vector_x_u
  return expected_reward, probability_matrix

def compute_value_policy(policy: List[np.ndarray]) -> np.ndarray:
  r, p = get_reward_probability_matrix(policy=policy)
  return np.linalg.solve(np.eye(env.observation_space.n) - gamma*p, r)

value_random_policy = compute_value_policy(pi_random)
print(value_random_policy)

We now use 2.

*Hint:* You can start with a zero initial condition.

In [None]:
def compute_value_policy_iterative(policy: List[np.ndarray]) ->  np.ndarray:
  value_pi = np.zeros(env.observation_space.n)
  for t in range(1000):
    value_pi_new = np.zeros(env.observation_space.n)
    for state in range(env.observation_space.n):
      for action in range(env.action_space.n):
        expected_reward_action, probability_vector = get_reward_probability_vector_state_action(state=state,
                                                                                                action=action)
        value_pi_new[state] += policy[state][action]*(expected_reward_action + gamma*np.dot(probability_vector, value_pi))
    # Check if to stop
    if np.max(np.abs(value_pi - value_pi_new)) <= 1e-6:
      print('The iteration converged at t=' + str(t+1) + '.\n')
      break
    value_pi = value_pi_new.copy()
  return value_pi_new

value_iterative_random_policy = compute_value_policy_iterative(pi_random)
print(value_iterative_random_policy)

Finally, we check the value at the first cell and check your result numerically (use the simulation done above)

In [None]:
# Print value at the starting cell (note: for other environment it might not be the first cell)
print(average_reward_random)

## Optimal policies

Now, we look into methods to find the optimal policy. In the lecture, you learned about the two main algorithms: Value iteration and policy iteration.

First, we need a function that computes the greedy policy. It consists of two ingredients:

*   A function that evaluates
$$x\mapsto\max_{u\in U}R_x^u + \gamma\sum_{x'} P_{xx'}^u V(x')$$

*   A second function that computes
$$\arg\max_{\pi}\sum_{u}\pi(x,u)\left(R_x^u + \gamma\sum_{x'} P_{xx'}^u V(x')\right)$$
Here, note that the minimum always is a deterministic policy.


In [None]:
def compute_bellman_operator(state: int, value_function: np.ndarray) -> Tuple[float, int]:
  candidates = np.zeros(env.action_space.n)
  for action in range(env.action_space.n):
    expected_reward_action, probability_vector = get_reward_probability_vector_state_action(state=state,
                                                                                            action=action)
    candidates[action] = expected_reward_action + gamma*np.dot(probability_vector, value_function)
  return np.max(candidates), np.argmax(candidates) # since it is a reward, we maximize

def compute_greedy_policy(value_function: np.ndarray) -> List[np.ndarray]:
  pi_greedy = []
  for state in range(env.observation_space.n):
    _,  best_action = compute_bellman_operator(state=state,
                                               value_function=value_function)
    stochastic_policy = np.zeros(env.action_space.n) # here we could also focus on deterministic policies
    stochastic_policy[best_action] = 1.0
    pi_greedy.append(stochastic_policy)
  return pi_greedy

### Value Iteration

We now perform value iteration. We run the algortihm for at most a maximum number of iterations and we stop when the difference between the value functions of consecutive steps (measured via $\|\cdot\|_\infty$) is smaller than some given tolerance.

In [None]:
# Maximum number of iterations
max_number_iterations = 1000
tol = 1e-5

# Initial guess for the value function
value = np.zeros(env.observation_space.n)

for t in range(max_number_iterations):
  value_new = np.zeros(env.observation_space.n)
  for state in range(env.observation_space.n):
    value_new[state], _ = compute_bellman_operator(state=state,
                                                   value_function=value)
  # Check if to stop
  if np.max(np.abs(value - value_new)) <= tol:
    print('Value iteration converged at t=' + str(t+1) + '.\n')
    break
  value = value_new.copy()

# Final result
value_value_iteration = value_new
pi_value_iteration = compute_greedy_policy(value_function=value_value_iteration)

# Print value at the starting cell
print(value_value_iteration[0])

Simulate policy

In [None]:
average_reward_value_iteration = simulate_environment(env,
                                                      policy=pi_value_iteration,
                                                      sim_video_name=video_name + '_value_iteration')
print('Average reward: ' + str(average_reward_value_iteration))

Display video

In [None]:
for episode in range(n_episodes_video):
    vid = HTML(render_mp4(f'{video_dir}/{video_name}_value_iteration-episode-{episode}.mp4'))
    display(vid)

### BONUS: Policy Iteration

If you have time left, try to implement policy iteration to obtain an optimal policy.

We can now implement policy iteration. We initialize the algorithm with the random policy and stop at convergence (or when a given number of iterations is reached).

In [None]:
# Maximum number of iterations
max_number_iterations = 100 # will converge in finitely many steps anyway
tol = 1e-5

# Initialize with random policy
pi = pi_random.copy()
value = compute_value_policy(policy=pi)

for t in range(max_number_iterations):
  pi_new = compute_greedy_policy(value_function=value)
  value_new = compute_value_policy(policy=pi_new)
  # Check if converged (we compare the value since the pi^ast might not be unique)
  if np.max(np.abs(value  - value_new)) <= tol:
    print('Policy iteration converged at t=' + str(t+1) + '.\n')
    break
  # Update policy
  pi = pi_new.copy()
  value = value_new.copy()

# Final result
pi_policy_iteration = pi
value_policy_iteration = compute_value_policy(policy=pi_policy_iteration)

# Print value at the starting cell
print(value_policy_iteration[0])

Simulate policy

In [None]:
average_reward_policy_iteration = simulate_environment(env,
                                                       policy=pi_policy_iteration,
                                                       sim_video_name=video_name + '_policy_iteration')
print('Average reward: ' + str(average_reward_policy_iteration))

Display video

In [None]:
for episode in range(n_episodes_video):
    vid = HTML(render_mp4(f'{video_dir}/{video_name}_policy_iteration-episode-{episode}.mp4'))
    display(vid)