## Implementing On-policy MC control


In [1]:
# import gym
import gymnasium as gym
import pandas as pd
from collections import defaultdict
import random
import flappy_bird_gymnasium

MC control

https://github.com/sudharsan13296/Deep-Reinforcement-Learning-With-Python/blob/master/04.%20Monte%20Carlo%20Methods/4.13.%20Implementing%20On-Policy%20MC%20Control.ipynb


Using flappy_bird_gymnasium

https://github.com/markub3327/flappy-bird-gymnasium


In [2]:
import time

env = gym.make("FlappyBird-v0")

obs, _ = env.reset()
while True:
    # Next action:
    # (feed the observation to your agent here)
    action = env.action_space.sample()

    # Processing:
    obs, reward, terminated, _, info = env.step(action)

    # Rendering the game:
    # (remove this two lines during training)
    env.render()
    time.sleep(1 / 30)  # FPS

    # Checking if the player is still alive
    if terminated:
        break

env.close()

Initialize the dictionary for storing the Q values:


In [4]:
Q = defaultdict(float)

Initialize the dictionary for storing the total return of the state-action pair:


In [5]:
total_return = defaultdict(float)

Initialize the dictionary for storing the count of the number of times a state-action pair is visited:


In [6]:
N = defaultdict(int)

## Define the epsilon-greedy policy

We learned that we select actions based on the epsilon-greedy policy, so we define a function called epsilon_greedy_policy which takes the state and Q value as an input and returns the action to be performed in the given state:


In [7]:
def epsilon_greedy_policy(state, Q):
    # set the epsilon value to 0.5
    epsilon = 0.5

    # sample a random value from the uniform distribution, if the sampled value is less than
    # epsilon then we select a random action else we select the best action which has maximum Q
    # value as shown below

    if random.uniform(0, 1) < epsilon:
        return env.action_space.sample()
    else:
        return max(
            list(range(env.action_space.n)), key=lambda x: Q[(tuple(state[0]), x)]
        )

In [8]:
def epsilon_greedy_policy_(state, Q):
    epsilon = 0.5

    if random.uniform(0, 1) < epsilon:
        return env.action_space.sample()
    else:
        state_tuple = tuple(state.tolist())  # Convert the NumPy array to a tuple
        return max(list(range(env.action_space.n)), key=lambda x: Q[(state_tuple, x)])

## Generating an epidose

Now, let's generate an episode using the epsilon-greedy policy. We define a function called generate_episode which takes the Q value as an input and returns the episode


In [9]:
num_timesteps = 100

In [10]:
def generate_episode(Q):
    # initialize a list for storing the episode
    episode = []

    # initialize the state using the reset function
    state = env.reset()

    # then for each time step
    for t in range(num_timesteps):
        # select the action according to the epsilon-greedy policy
        action = epsilon_greedy_policy_(state, Q)

        # perform the selected action and store the next state information
        next_state, reward, done, _, info = env.step(action)

        # store the state, action, reward in the episode list
        episode.append((state, action, reward))

        # if the next state is a final state then break the loop else update the next state to the current
        # state
        if done:
            break

        state = next_state

    return episode

## Computing the optimal policy

Now, let's learn how to compute the optimal policy. First, let's set the number of iterations, that is, the number of episodes, we want to generate:


In [11]:
num_iterations = 50000

We learned that in the on-policy control method, we will not be given any policy as an input. So, we initialize a random policy in the first iteration and improve the policy iteratively by computing Q value. Since we extract the policy from the Q function, we don't have to explicitly define the policy. As the Q value improves the policy also improves implicitly. That is, in the first iteration we generate episode by extracting the policy (epsilon-greedy) from the initialized Q function. Over a series of iterations, we will find the optimal Q function and hence we also find the optimal policy.


In [12]:
state = env.reset()
max(list(range(env.action_space.n)), key=lambda x: Q[(tuple(state[0]), x)])

0

In [15]:
# for each iteration
for i in range(num_iterations):
    # so, here we pass our initialized Q function to generate an episode
    episode = generate_episode(Q)

    # get all the state-action pairs in the episode
    # all_state_action_pairs = [(s, a) for (s, a, r) in episode]
    all_state_action_pairs = [(tuple(s.tolist()), a) for (s, a, r) in episode]

    # store all the rewards obtained in the episode in the rewards list
    rewards = [r for (s, a, r) in episode]

    # for each step in the episode
    for t, (state, action, reward) in enumerate(episode):
        # if the state-action pair is occurring for the first time in the episode
        if not (state, action) in all_state_action_pairs[0:t]:
            # compute the return R of the state-action pair as the sum of rewards
            R = sum(rewards[t:])

            # update total return of the state-action pair
            # total_return[(state, action)] = total_return[(state, action)] + R
            total_return[(tuple(state), action)] = (
                total_return[(tuple(state), action)] + R
            )

            # update the number of times the state-action pair is visited
            # N[(state, action)] += 1
            N[(tuple(state), action)] += 1

            # compute the Q value by just taking the average
            # Q[(state, action)] = total_return[(state, action)] / N[(state, action)]
            Q[(tuple(state), action)] = (
                total_return[(tuple(state), action)] / N[(tuple(state), action)]
            )

AttributeError: 'tuple' object has no attribute 'tolist'

In [None]:
df = pd.DataFrame(Q.items(), columns=["state_action pair", "value"])

In [None]:
df.head(11)