----

# Monte Carlo Control (On-Policy)
Monte Carlo Control is a method used in reinforcement learning to find the optimal policy for decision-making problems. Unlike policy evaluation, which estimates the value of a given policy, Monte Carlo Control focuses on improving the policy itself based on the returns received from episodes. This method does not require a model of the environment and uses the concept of exploration and exploitation to balance between trying out new actions and sticking with the best-known actions.

## Why use $Q(s,a)$ instead of $V(s)$ in Monte Carlo Methods?

1. Policy imporvment and optimization: The fundamental goal of reinforcement learning is to find an optimal policy. While $V(s)$ provides the value of being in a state under a certain policy, it does not directly inform the agent about which action to take. $Q(s,a)$, on the other hand, directly evaluates the potential of each action, making it more straightforward to select the best action without needing a model of the environment. This is particularly useful in Monte Carlo Control methods for policy improvement, where the policy is made greedy with respect to the $Q$ values.


2. Dealing with model-free environments: Moreover, Monte Carlo methods are model-free, meaning they do not require knowledge of the environment's dynamics (i.e., the transition probabilities and rewards). While both $V$ and $Q$ functions can be estimated without a model, the $Q$ function is more conducive to model-free control. This is because once $Q(s,a)$ is known, the agent can directly decide the best action without needing to know the transition probabilities to the next states, which would be required to use $V(s)$ effectively for control.

<img src="images/mc-es-first-visit.png" width="1000" height="580" >

In the following code cell we implement this algorithm. 



----

In [1]:
import numpy as np

def first_visit_monte_carlo_es(env, gamma=1.0, num_episodes=10_000):

    # We will use Python dictionaries to represent the policy, Q and returns.
    # The policy is a mapping from state to action; initially, it is a random policy.
    pi = {state : np.random.choice([action for action in env.action_space]) for state in env.state_space}

    # Q is a mapping from state-action pair to expected return.
    Q = {state : {action : 0.0 for action in env.action_space} for state in env.state_space}

    # returns is a mapping from state-action pair to a list of returns.
    returns = {state : {action : [] for action in env.action_space} for state in env.state_space}

    for _ in range(num_episodes):

        # Generate an episode.
        episode = []

        # Starting from a random state and action, after which the policy is followed.
        state = env.state_space.sample()
        action = env.action_space.sample()
        env.reset(observation=state)
        next_state, reward, terminated = env.step(action)
        episode.append((state, action, reward))
        state = next_state

        # Follow the policy.
        while not terminated:
            action = pi[state]
            next_state, reward, terminated = env.step(action)
            episode.append((state, action, reward))
            state = next_state

        # Update Q and pi. This is the first-visit MC method.
        # G is the expected return from the current state onwards.
        G = 0
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = gamma*G + reward
            if (state, action) not in [(x[0], x[1]) for x in episode[:t]]:
                returns[state][action].append(G)
                Q[state][action] = np.mean(returns[state][action])
                pi[state] = np.argmax(Q[state])

    return pi

In [2]:
# Import the custom blackjack environment
from custom_classes import CustomBlackjackEnv
env = CustomBlackjackEnv()

pi = first_visit_monte_carlo_es(env, num_episodes=500_000)

In [3]:
from mc_utils import win_percentage
num_games = 100_000

################################################################################
# Random policy win percentage over 100_000 games.

num_wins = 0
for _ in range(num_games):
    observation = env.reset()
    terminated = False
    while not terminated:
        action = env.action_space.sample()
        observation, reward, terminated = env.step(action)
    if reward == 1:
        num_wins += 1
win_percentage = num_wins / num_games * 100
print(f"Random policy win percentage over {num_games} games: {win_percentage:.1f}%")

################################################################################
# Learned policy win percentage over 100_000 games.
num_wins = 0
for _ in range(num_games):
    observation = env.reset()
    terminated = False
    while not terminated:
        action = pi[observation]
        observation, reward, terminated = env.step(action)
    if reward == 1:
        num_wins += 1
win_percentage = num_wins / num_games * 100
print(f"Policy exploring starts win percentage over {num_games} games: {win_percentage:.1f}%")

Random policy win percentage over 100000 games: 28.1%
Policy exploring starts win percentage over 100000 games: 38.5%


In [4]:
def on_policy_first_visit_monte_carlo_control(env, gamma=1.0, epsilon=0.1, num_episodes=10_000):

    # We will use Python dictionaries to represent the policy, Q and returns.
    # epsilon-soft policy.
    pi = {state : np.ones(env.action_space.n)/env.action_space.n for state in env.state_space}

    # Q is a mapping from state-action pair to expected return.
    Q = {state : {action : 0.0 for action in env.action_space} for state in env.state_space}

    # returns is a mapping from state-action pair to a list of returns.
    returns = {state : {action : [] for action in env.action_space} for state in env.state_space}

    action_space = [a for a in env.action_space]

    for _ in range(num_episodes):

        # Generate an episode.
        episode = []
        state = env.reset()
        action = np.random.choice(action_space, p=pi[state])
        next_state, reward, terminated = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        while not terminated:
            action = np.random.choice(action_space, p=pi[state])
            next_state, reward, terminated = env.step(action)
            episode.append((state, action, reward))
            state = next_state

        G = 0
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = gamma*G + reward
            if (state, action) not in [(x[0], x[1]) for x in episode[:t]]:
                returns[state][action].append(G)
                Q[state][action] = np.mean(returns[state][action])
                A_star = np.argmax(Q[state])
                for a in env.action_space:
                    pi[state][a] = 1 - epsilon + epsilon / env.action_space.n if a == A_star else epsilon / env.action_space.n

    return pi

In [5]:
pi = on_policy_first_visit_monte_carlo_control(env,epsilon=0.0025, num_episodes=500_000)

In [117]:
# Learned policy win percentage over 100_000 games.
num_games = 100_000
num_wins = 0
for _ in range(num_games):
    observation = env.reset()
    terminated = False
    while not terminated:
        action = np.random.choice([a for a in env.action_space], p=pi[observation])
        observation, reward, terminated = env.step(action)
    if reward == 1:
        num_wins += 1

win_percentage = num_wins / num_games * 100
print(f"Win percentage over {num_games} games: {win_percentage:.1f}%")

Win percentage over 100000 games: 38.2%
