# Lecture: Implementation of Policy Iteration

We want to implement policy iteration for the `frozen lake` environment provided by gymnasium. Gymnasium is a reinforcement learning (RL) environment library that provides a collection of pre-built environments for training and evaluating RL algorithms. It is a maintained fork of the original OpenAI Gym, which became unmaintained after OpenAI shifted focus.

The Environments follow a universal API, making it easy to test different RL algorithms on various tasks.

If you execute the code locally and have pygame installed as in the requirements.txt a window should pop up if you choose render_mode='human' showing the env and dynamics.

In [None]:
!git clone https://github.com/Fjoelsak/RL.git
!cp RL/03_Dynamic_Programming/mdp_control.py ./

In [None]:
import gymnasium as gym

env = gym.make('FrozenLake-v1',
               desc = None,
               map_name = "4x4",
               is_slippery = False,
               render_mode = 'human')


This is an exemplary use of the provided environment in which the agent takes a randomly sampled action in each state.

In [None]:
obs,_ = env.reset()

while True:
    env.render()
    action = env.action_space.sample()

    obs, reward, terminated, truncated, _ = env.step(action)

    if terminated or truncated:
        break

env.close()

# Excercise 1: Getting to know the environment

Go to the farama [foundation docs of the environment](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) and determine how the state and action spaces are defined, how the reward function is implemented and how the condition for a termination of the episode is implemented. In addition, check what the `is_slippery` boolean is doing.

In [None]:
env.observation_space

In [None]:
env.action_space

# Excercise 2: Policy Iteration for the Frozen lake environment

Check the class `mdpControl` in `mdp_control.py` and implement the functions `policy_evaluation()` and `policy_iteration()`.

In order to get the transition probability matrix explicity you can use the unwrapped environment of the frozen lake env. With `env.unwrapped.P` you get for each state (0-15) for each action (0-3) the corresponding transition probability, next_state, reward and a done flag whether the episode is terminated.

In [None]:
env.unwrapped.P

In [None]:
import mdp_control

env = gym.make('FrozenLake-v1',
               desc = None,
               map_name = "4x4",
               is_slippery = False,
               render_mode = 'human')

mdp = mdp_control.mdpControl(env)
p, V = mdp.policy_iteration()

# rendering of the agent acting in the env with your optimized policy
mdp.render_single(p)
env.close()

# Excercise 3: Value Iteration for the Frozen lake environment

Check the class `mdpControl` in `mdp_control.py` and implement the function `value_iteration()`.

In [None]:
import mdp_control

env = gym.make('FrozenLake-v1',
               desc = None,
               map_name = "4x4",
               is_slippery = False,
               render_mode = 'human')


mdp = mdp_control.mdpControl(env)
p, V = mdp.value_iteration()

# rendering of the agent acting in the env with your optimized policy
mdp.render_single(p)
env.close()