----

## Monte Carlo Control without Exploring Starts 

**Question.** How can we avoid the unlikely assumption of exploring starts? (That all actions are selected infinitely often)

**Solutions.** On-policy methods and off-policy methods.... 

### On-Policy Methods
On-policy methods attempt to evaluate or to improve the policy that is being used to make decisions, whereas off-policy methods evaluate or improve a policy different from the one used to generate the data; with Monte Carlo methods the data here are the episodes generated by the policy. The Monte Carlo ES method above is considered a on-policy method, but we now discuss a on-policy method  which does not assume anything about exploring starts. Off-policy methods are considered in the next section. 

* On-policy control methods typically have *soft policies*, meaning that $\pi(a, s) > 0$ for all $s \in \mathcal{S}$ and all $a \in \mathcal{A}(s)$. 

The most popular soft policies are the *$\epsilon$-greedy* policies, meaning that most of the time they choose an action that has a maximal estimated action value, but with probability $\epsilon$ they instead select an action at random (recall such policies from the Bandits problem covered in Sutton & Barto Chapter 2). 

* ($\implies$) All nongreedy actions are given the minimal probability of selection $\frac{\epsilon}{|\mathcal{A}(s)|}$. 

* ($\implies$) The remaining greedy action is given the remaining buld probability of $1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(s)|}$. 

Find the code implementing **on-policy first-visit Monte Carlo control (for $\epsilon$-soft policies), estimates for $\pi \approx \pi_{*}$** in the following code cell.

----

In [2]:
import numpy as np

def first_visit_monte_carlo_es(env, gamma=1.0, num_episodes=10_000):

    # We will use Python dictionaries to represent the policy, Q and returns.
    # The policy is a mapping from state to action; initially, it is a random policy.
    pi = {state : np.random.choice([action for action in env.action_space]) for state in env.state_space}

    # Q is a mapping from state-action pair to expected return.
    Q = {state : {action : 0.0 for action in env.action_space} for state in env.state_space}

    # returns is a mapping from state-action pair to a list of returns.
    returns = {state : {action : [] for action in env.action_space} for state in env.state_space}

    for _ in range(num_episodes):

        # Generate an episode.
        episode = []

        # Starting from a random state and action, after which the policy is followed.
        state = env.state_space.sample()
        action = env.action_space.sample()
        env.reset(observation=state)
        next_state, reward, terminated = env.step(action)
        episode.append((state, action, reward))
        state = next_state

        # Follow the policy.
        while not terminated:
            action = pi[state]
            next_state, reward, terminated = env.step(action)
            episode.append((state, action, reward))
            state = next_state

        # Update Q and pi. This is the first-visit MC method.
        # G is the expected return from the current state onwards.
        G = 0
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = gamma*G + reward
            if (state, action) not in [(x[0], x[1]) for x in episode[:t]]:
                returns[state][action].append(G)
                Q[state][action] = np.mean(returns[state][action])
                pi[state] = np.argmax(Q[state])

    return pi

In [3]:
# Import the custom blackjack environment
from custom_classes import CustomBlackjackEnv
env = CustomBlackjackEnv()

pi = first_visit_monte_carlo_es(env, num_episodes=500_000)

In [4]:
num_games = 100_000

################################################################################
# Random policy win percentage over 100_000 games.

num_wins = 0
for _ in range(num_games):
    observation = env.reset()
    terminated = False
    while not terminated:
        action = env.action_space.sample()
        observation, reward, terminated = env.step(action)
    if reward == 1:
        num_wins += 1
win_percentage = num_wins / num_games * 100
print(f"Random policy win percentage over {num_games} games: {win_percentage:.1f}%")

################################################################################
# Learned policy win percentage over 100_000 games.
num_wins = 0
for _ in range(num_games):
    observation = env.reset()
    terminated = False
    while not terminated:
        action = pi[observation]
        observation, reward, terminated = env.step(action)
    if reward == 1:
        num_wins += 1
win_percentage = num_wins / num_games * 100
print(f"Policy exploring starts win percentage over {num_games} games: {win_percentage:.1f}%")

Random policy win percentage over 100000 games: 28.4%
Policy exploring starts win percentage over 100000 games: 38.5%


In [5]:
def on_policy_first_visit_monte_carlo_control(env, gamma=1.0, epsilon=0.1, num_episodes=10_000):

    # We will use Python dictionaries to represent the policy, Q and returns.
    # epsilon-soft policy.
    pi = {state : np.ones(env.action_space.n)/env.action_space.n for state in env.state_space}

    # Q is a mapping from state-action pair to expected return.
    Q = {state : {action : 0.0 for action in env.action_space} for state in env.state_space}

    # returns is a mapping from state-action pair to a list of returns.
    returns = {state : {action : [] for action in env.action_space} for state in env.state_space}

    action_space = [a for a in env.action_space]

    for _ in range(num_episodes):

        # Generate an episode.
        episode = []
        state = env.reset()
        action = np.random.choice(action_space, p=pi[state])
        next_state, reward, terminated = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        while not terminated:
            action = np.random.choice(action_space, p=pi[state])
            next_state, reward, terminated = env.step(action)
            episode.append((state, action, reward))
            state = next_state

        G = 0
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = gamma*G + reward
            if (state, action) not in [(x[0], x[1]) for x in episode[:t]]:
                returns[state][action].append(G)
                Q[state][action] = np.mean(returns[state][action])
                A_star = np.argmax(Q[state])
                for a in env.action_space:
                    pi[state][a] = 1 - epsilon + epsilon / env.action_space.n if a == A_star else epsilon / env.action_space.n

    return pi

In [6]:
pi = on_policy_first_visit_monte_carlo_control(env,epsilon=0.0025, num_episodes=500_000)

In [117]:
# Learned policy win percentage over 100_000 games.
num_games = 100_000
num_wins = 0
for _ in range(num_games):
    observation = env.reset()
    terminated = False
    while not terminated:
        action = np.random.choice([a for a in env.action_space], p=pi[observation])
        observation, reward, terminated = env.step(action)
    if reward == 1:
        num_wins += 1

win_percentage = num_wins / num_games * 100
print(f"Win percentage over {num_games} games: {win_percentage:.1f}%")

Win percentage over 100000 games: 38.2%
