# Week 9 - AI Lab

Author: Khushee Kapoor

Registration Number: 200968052

To start, we import the gym and numpy libraries.

In [None]:
# importing the libraries
import gym
import numpy as np

Next, we create the CliffWalking Environment by importing it from the gym.

In [None]:
# creating the environment
env = gym.make('CliffWalking-v0')

  and should_run_async(code)
  deprecation(
  deprecation(


The below code defines a function monte_carlo_es which implements the Monte Carlo ES (Exploring Starts) algorithm to learn the optimal policy for the Cliff Walking environment.

The input to the function is the OpenAI Gym environment env and the number of episodes n_episodes for which the algorithm should run.

Inside the function, we initialize the state-action value function Q and the visit count N to zero. We also set the discount factor gamma to 1.0, which implies that we are considering undiscounted episodes.

In each episode, we reset the environment to the starting state and generate an episode using exploring starts. We choose a random action at the start of each episode to ensure that we explore all possible state-action pairs. We collect the sequence of (state, action, reward) tuples obtained during the episode in a list called episode.

After generating the episode, we update the state-action values using the Monte Carlo method. We calculate the returns for each time step of the episode by summing the rewards obtained from that time step till the end of the episode. We then update the Q values for each state-action pair encountered in the episode by incrementally averaging the returns. We also update the visit count N for each state-action pair.

Finally, we derive the optimal policy from the Q values by selecting the action that maximizes the Q value for each state. We return the optimal policy, the state-action value function Q, and the list of total steps taken in each episode.

Overall, the monte_carlo_es function runs the Monte Carlo ES algorithm for the specified number of episodes and returns the learned optimal policy and the total steps taken in each episode.

In [None]:
# Monte Carlo ES (Exploring Starts)
def monte_carlo_es(env, n_episodes=500):
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    N = np.zeros((env.observation_space.n, env.action_space.n))
    gamma = 1.0
    total_steps = []
    
    for i in range(n_episodes):
        state = env.reset()
        episode = []
        done = False
        steps = 0

        # generate an episode using exploring starts
        while not done:
            action = np.random.choice(env.action_space.n)
            next_state, reward, done, info = env.step(action)
            episode.append((state, action, reward))
            state = next_state
            steps += 1
        total_steps.append(steps)
        
        # update Q values using the episode
        returns = 0
        for j in range(len(episode)-1, -1, -1):
            state, action, reward = episode[j]
            returns = gamma*returns + reward
            N[state][action] += 1
            Q[state][action] += (returns - Q[state][action])/N[state][action]
    
    # derive optimal policy from Q values
    policy = np.argmax(Q, axis=1)
    
    return policy, Q, total_steps

  and should_run_async(code)


This code defines a function on_policy_mc_control which implements the on-policy first-visit Monte Carlo control algorithm with Ɛ-soft policies to learn the optimal policy for the Cliff Walking environment.

The input to the function is the OpenAI Gym environment env, the number of episodes n_episodes for which the algorithm should run, and the Ɛ parameter of the Ɛ-soft policy epsilon.

Inside the function, we initialize the state-action value function Q and the visit count N to zero. We also set the discount factor gamma to 1.0, which implies that we are considering undiscounted episodes.

In each episode, we reset the environment to the starting state and generate an episode using an Ɛ-soft policy. At each time step, we choose a random action with probability Ɛ or the greedy action (i.e., the action that maximizes the Q value) with probability 1 - Ɛ. We update the Q values using the incrementally averaged returns obtained from the episode. We also update the visit count N for each state-action pair encountered in the episode.

Finally, we derive the optimal policy from the Q values by selecting the action that maximizes the Q value for each state. We return the optimal policy, the state-action value function Q, and the list of total steps taken in each episode.

Overall, the on_policy_mc_control function runs the on-policy first-visit Monte Carlo control algorithm with Ɛ-soft policies for the specified number of episodes and returns the learned optimal policy and the total steps taken in each episode.

In [None]:
# On-policy first-visit MC control (for Ɛ-soft policies), for Ɛ = 0.1
def on_policy_mc_control(env, n_episodes=500, epsilon=0.1):
    Q = np.zeros((env.observation_space.n, env.action_space.n))
    N = np.zeros((env.observation_space.n, env.action_space.n))
    gamma = 1.0
    total_steps = []
    
    for i in range(n_episodes):
        state = env.reset()
        done = False
        steps = 0
        
        # generate an episode using Ɛ-soft policy
        while not done:
            if np.random.uniform(0, 1) < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(Q[state])
            next_state, reward, done, info = env.step(action)
            N[state][action] += 1
            Q[state][action] += (reward + gamma*np.max(Q[next_state]) - Q[state][action])/N[state][action]
            state = next_state
            steps += 1
        total_steps.append(steps)
    
    # derive optimal policy from Q values
    policy = np.argmax(Q, axis=1)
    
    return policy, Q, total_steps

The below code snippet applies the monte_carlo_es and on_policy_mc_control functions to the Cliff Walking environment env to learn the optimal policy using two different algorithms: Monte Carlo ES and On-policy first-visit MC control with Ɛ-soft policies.

The output of the monte_carlo_es function is three variables: monte_carlo_es_policy, monte_carlo_es_q, and total_steps_es.

- monte_carlo_es_policy is the learned optimal policy as an array of shape (48,), where the i-th element is the action that maximizes the Q value for state i.

- monte_carlo_es_q is the state-action value function Q as an array of shape (48,4), where the i-th row corresponds to the Q values for the i-th state and the j-th column corresponds to the Q value for taking action j in state i.

- total_steps_es is a list of length n_episodes containing the total number of steps taken in each episode during the Monte Carlo ES algorithm.

Similarly, the output of the on_policy_mc_control function is three variables: on_policy_mc_control_policy, on_policy_mc_control_q, and total_steps_control.

- on_policy_mc_control_policy is the learned optimal policy as an array of shape (48,), where the i-th element is the action that maximizes the Q value for state i.

- on_policy_mc_control_q is the state-action value function Q as an array of shape (48,4), where the i-th row corresponds to the Q values for the i-th state and the j-th column corresponds to the Q value for taking action j in state i.

- total_steps_control is a list of length n_episodes containing the total number of steps taken in each episode during the On-policy first-visit MC control algorithm.

Overall, this code snippet applies two different algorithms to learn the optimal policy for the Cliff Walking environment and stores the learned policy, Q function, and total steps taken in each episode for both algorithms.

In [None]:
# run Monte Carlo ES and On-policy first-visit MC control
monte_carlo_es_policy, monte_carlo_es_q, total_steps_es = monte_carlo_es(env)
on_policy_mc_control_policy, on_policy_mc_control_q, total_steps_control = on_policy_mc_control(env)

  and should_run_async(code)


Next, we compare the total number of steps taken to reach the optimal policy using both the techniques by summing over the total number of steps taken in each episode.

In [None]:
# total number of steps taken to reach optimal policy
print(str.format('Total Number of Steps taken to reach Optimal Policy using Monte Carlo ES: {}', sum(total_steps_es)))
print(str.format('Total Number of Steps taken to reach Optimal Policy using On-Policy First-Visit MC Control: {}', sum(total_steps_control)))

Total Number of Steps taken to reach Optimal Policy using Monte Carlo ES: 3084474
Total Number of Steps taken to reach Optimal Policy using On-Policy First-Visit MC Control: 17440


  and should_run_async(code)


As we can see, the number of steps taken in the On-Policy First-Visit MC Control technique is significantly less than the Monte Carlo ES technique.

Similarly, we compare the average number of steps taken to reach the optimal policy using both the techniques by summing over the total number of steps taken in each episode and dividing it by the total number of episodes.

In [None]:
# average number of steps per episode taken to reach optimal policy
print(str.format('Average Number of Steps per Episode taken to reach Optimal Policy using Monte Carlo ES: {}', sum(total_steps_es)/len(total_steps_es)))
print(str.format('Average Number of Steps per Episode taken to reach Optimal Policy using On-Policy First-Visit MC Control: {}', sum(total_steps_control)/len(total_steps_control)))

Average Number of Steps per Episode taken to reach Optimal Policy using Monte Carlo ES: 6168.948
Average Number of Steps per Episode taken to reach Optimal Policy using On-Policy First-Visit MC Control: 34.88


  and should_run_async(code)


As we can see, the average number of steps taken per episode in the On-Policy First-Visit MC Control technique is significantly less than the Monte Carlo ES technique.


The above comparison imply that the On-Policy First-Visit MC Control technique converges to the Optimal Policy faster than the Monte Carlo ES technique, and hence, has a better performance.