----

# Off-Policy Monte Carlo Prediction 

Monte Carlo prediction methods can be implemented incrementally, on an episode-by-
episode basis, using extensions of the techniques described in Chapter 2 (Section 2.4).
Whereas in Chapter 2 we averaged rewards, in Monte Carlo methods we average returns.
In all other respects exactly the same methods as used in Chapter 2 can be used for on-
policy Monte Carlo methods. For o↵-policy Monte Carlo methods, we need to separately
consider those that use ordinary importance sampling and those that use weighted
importance sampling.

O↵-policy Monte Carlo control methods use one of the techniques presented in the
preceding two sections. They follow the behavior policy while learning about and
improving the target policy. These techniques require that the behavior policy has a
nonzero probability of selecting all actions that might be selected by the target policy
(coverage). To explore all possibilities, we require that the behavior policy be soft (i.e.,
that it select all actions in all states with nonzero probability).

<img src="images/off-policy-mc-prediction.png" width="1000" height="580" >


----

In [1]:
import numpy as np

def off_policy_monte_carlo_prediction(env, target_policy, behavior_policy, num_episodes, gamma=1.0):
    Q = {state : {action : 0.0 for action in env.action_space} for state in env.state_space}
    C = {state : {action : 0.0 for action in env.action_space} for state in env.state_space}

    for _ in range(num_episodes):
        episode = []
        state = env.reset()
        action = behavior_policy(state)
        while True:
            next_state, reward, done = env.step(action)
            episode.append((state, action, reward))
            if done:
                break
            state = next_state
            action = behavior_policy(state)

        G = 0
        W = 1
        for t in range(len(episode) - 1, -1, -1):
            state, action, reward = episode[t]
            G = gamma * G + reward
            C[state][action] += W
            Q[state][action] += (W / C[state][action]) * (G - Q[state][action])
            W = W * (target_policy.prob(state, action) / behavior_policy.prob(state, action))
            if W == 0:
                break
    return Q

In [4]:
from custom_classes import CustomBlackjackEnv, SoftPolicy, DeterministicPolicy
env = CustomBlackjackEnv()

# Behavior policy: random
b = SoftPolicy(env, {state : [1/env.action_space.n for _ in range(env.action_space.n)] for state in env.state_space})
# Target policy: stick if sum of cards >= 20, hit otherwise
pi = DeterministicPolicy(env, {state : int(state[0] >= 20)  for state in env.state_space})

Q = off_policy_monte_carlo_prediction(env, pi, b, num_episodes=100_000)


In [7]:
Q

{(4, 1, False): {0: -0.7272727272727272, 1: -0.6363636363636364},
 (4, 1, True): {0: 0.0, 1: 0.0},
 (4, 2, False): {0: -0.4782608695652173, 1: 0.0},
 (4, 2, True): {0: 0.0, 1: 0.0},
 (4, 3, False): {0: -0.41176470588235303, 1: -0.23076923076923078},
 (4, 3, True): {0: 0.0, 1: 0.0},
 (4, 4, False): {0: -0.49999999999999994, 1: 0.7777777777777778},
 (4, 4, True): {0: 0.0, 1: 0.0},
 (4, 5, False): {0: -0.33333333333333337, 1: -0.4736842105263157},
 (4, 5, True): {0: 0.0, 1: 0.0},
 (4, 6, False): {0: 0.0, 1: -0.33333333333333337},
 (4, 6, True): {0: 0.0, 1: 0.0},
 (4, 7, False): {0: -0.39999999999999997, 1: -0.7777777777777778},
 (4, 7, True): {0: 0.0, 1: 0.0},
 (4, 8, False): {0: -0.1724137931034482, 1: -0.8666666666666666},
 (4, 8, True): {0: 0.0, 1: 0.0},
 (4, 9, False): {0: -0.7777777777777778, 1: -0.4},
 (4, 9, True): {0: 0.0, 1: 0.0},
 (4, 10, False): {0: -0.45999999999999996, 1: -0.6595744680851063},
 (4, 10, True): {0: 0.0, 1: 0.0},
 (5, 1, False): {0: -0.7777777777777776, 1: -0.66