In [1]:
import numpy as np
import gym
from gym import wrappers

env.step(action): Executes an action in the environment and returns the next observation, reward, a boolean done (indicating if the episode has ended), and additional info.

This code\
obs, reward, done, _ = env.step(int(policy[obs]))\
does the following:\
The policy selects an action based on the current observation obs, which is cast to an integer before being passed to env.step. The environment executes the action and returns:
1) The next observation obs.
2) The reward for the action reward.
3) Whether the episode has ended (done).
4) Additional info (ignored in this code).

In [2]:
def run_episode(env, policy, gamma = 1.0, render = True):
    """ Evaluates policy by using it to run an episode and finding its
    total reward.
    args:
    env: gym environment.
    policy: the policy to be used.
    gamma: discount factor.
    render: boolean to turn rendering on/off.
    returns:
    total reward: real value of the total reward recieved by agent under policy.
    """
    obs = env.reset() # Resets the environment to an initial state and returns the initial observation.
    total_reward = 0 
    step_idx = 0
    while True:
        if render:
            env.render() # A flag to control whether the environment's visualization is rendered during the episode.
        obs, reward, done , _ = env.step(int(policy[obs])) # Here, policy[obs] indicates the action for a given observation obs
        total_reward += (gamma ** step_idx * reward)
        step_idx += 1
        if done:
            break
    return total_reward

In [3]:
def evaluate_policy(env, policy, gamma = 1.0,  n = 100):
    """ Evaluates a policy by running it n times.
    returns:
    average total reward

    I.e., here we calculate the expected reward of the policy
    """
    scores = [run_episode(env, policy, gamma = gamma, render = False) for _ in range(n)]
    return np.mean(scores)

In the code below a temporary array q_sa is initialized to store the Q-values for each possible action in state s.\

env.P[s][a]: Contains a list of possible outcomes (state transitions) for taking action a in state s. Each outcome is represented as a tuple (p, s_, r, done):
- p: Probability of the transition.
- s_: The next state after the transition.
- r: Reward for the transition.
- done: Boolean indicating if the episode ends.\

The value of q_sa[a] is updated by adding the expected reward:
- Immediate reward: r.
- Discounted value of the next state: gamma * v[s_].
- Weighted by the probability of the transition: p.

In [4]:
def extract_policy(env, v, gamma = 1.0):
    """ Extract the policy given a value-function """
    policy = np.zeros(env.nS) # The number of states in the environment. 
    for s in range(env.nS):
        q_sa = np.zeros(env.action_space.n)
        for a in range(env.action_space.n): # Loops through all possible actions a in the current state s
            for next_sr in env.P[s][a]:
                # next_sr is a tuple of (probability, next state, reward, done)
                p, s_, r, _ = next_sr
                q_sa[a] += (p * (r + gamma * v[s_]))
        policy[s] = np.argmax(q_sa)
    return policy

Using the Bellman equation, we update the value of $Q$ and $V$ using the following principle:
$$Q^{(n+1)}(x,u) = \gamma P_1^{u} V^{(n)}(x) + r(x)$$
and 
$$ V^{(n+1)}(x) = \sup_{u \in U} Q^{(n+1)}(x,u).$$

In [5]:
def value_iteration(env, gamma = 1.0):
    """ Value-iteration algorithm """
    v = np.zeros(env.nS)  # initialize value-function
    max_iterations = 100000
    eps = 1e-20
    for i in range(max_iterations):
        prev_v = np.copy(v)
        for s in range(env.nS):
            q_sa = [sum([p*(r + prev_v[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.nA)] 
            v[s] = max(q_sa)
        if (np.sum(np.fabs(prev_v - v)) <= eps):
            print ('Value-iteration converged at iteration# %d.' %(i+1))
            break
    return v

In OpenAI Gym, environments are often wrapped with wrappers to extend their functionality. These wrappers modify or augment the behavior of the base environment. For example:

- Observation Wrappers: Modify the observations returned by the environment.
- Reward Wrappers: Transform the rewards.
- Action Wrappers: Change the action space or map actions to specific formats.
- Monitor Wrappers: Log episodes or render visuals.

Example:
The TimeLimit wrapper adds a maximum time step limit to the "CartPole-v1" environment. By calling env.unwrapped, we access the base CartPoleEnv without the time limit imposed by the wrapper.

In [8]:
env_name  = 'FrozenLake8x8-v0'
gamma = 1.0
env = gym.make(env_name)
env=env.unwrapped
optimal_v = value_iteration(env, gamma)
policy = extract_policy(env, optimal_v, gamma)
policy_score = evaluate_policy(env, policy, gamma, n=1000)
print('Policy average score = ', policy_score)

DeprecatedEnv: Environment version v0 for `FrozenLake8x8` is deprecated. Please use `FrozenLake8x8-v1` instead.