# Monte Carlo Methods

**Monte Carlo methods** require only experience sample sequences of states, actions, and rewards  
from actual or simulated interaction with an environment. Learning from actual experience is striking  
because it requires no prior knowledge of the environment’s dynamics, yet can still attain optimal behavior.

Monte Carlo methods are ways of solving the reinforcement learning problem based on **averaging sample returns**.  
To ensure that well-defined returns are available, here we define Monte Carlo methods only for episodic tasks.  
That is, we assume experience is divided into episodes, and that all episodes eventually terminate no matter  
what actions are selected.

Only on the completion of an episode are value estimates and policies changed.

Monte Carlo methods can thus be incremental in an **episode-by-episode** sense, but not in a step-by-step (online) sense.  
The term “Monte Carlo” is often used more broadly for any estimation method whose operation involves a **significant random component**.

Here we use it specifically for methods based on averaging complete returns.

There are multiple states, each acting like a different bandit problem (like an associative-search or contextual bandit)  
and the different bandit problems are interrelated. That is, the return after taking an action in one state depends  
on the actions taken in later states in the same episode. Because all the action selections are undergoing learning,  
the problem becomes nonstationary from the point of view of the earlier state.

To handle the nonstationarity, we adapt the idea of **general policy iteration** (GPI).

## Monte Carlo Prediction

Monte Carlo prediction methods allow to learn the state-value function for a given policy.  
Recall that the value of a state is the expected return—expected cumulative future discounted reward—starting from that state.

An obvious way to estimate it from experience, then, is simply to average the returns observed after visits to that state.  
As more returns are observed, the average should converge to the expected value. This idea underlies all Monte Carlo methods.

The **first visit Monte Carlo method** estimates $v_\pi(s)$ as the average of the returns following visits to $s$.  
By the law of large numbers the sequence of averages of these estimates standard deviation of its error falls as $1/\sqrt(n)$,
where n is the number of returns averaged.

In [None]:
import collections

class FirstVisitMonteCarloPrediction():

    def __init__(self, gamma, policy):
        self.gamma  = gamma
        self.policy = policy

        self.state_value = collections.defaultdict(lambda: 0)
        self.returns = collections.defaultdict(lambda: 0)

        self.states = []
        self.rewards = []

    def action(self, state):
        return self.policy(state)
    
    def observe(self, state, action, reward):
        self.states.append(state)
        self.rewards.append(reward)
    
    def optimize(self):
        g = 0

        for t in reversed(range(len(self.states))):
            g = self.gamma * g + self.rewards[t]

            if not self.states[t] in self.states[0:t]:
                self.returns[self.states[t]] += 1
                self.state_value[self.states[t]] += (1 / self.returns[self.states[t]]) * (g - self.state_value[self.states[t]])
        
        self.states = []
        self.rewards = []

### Blackjack

In [None]:
# Import the necessaries libraries
import numpy as np
import plotly.graph_objects as go

import plotly.io as pio
pio.renderers.default = 'notebook'

import gymnasium as gym
env = gym.make('Blackjack-v1', natural=True, sab=False)

In [None]:
def play_env(env, agent):
    terminated = False
    observation, info = env.reset()

    while not terminated:
        action = agent.action(observation)

        new_observation, reward, terminated, truncated, info = env.step(action)

        agent.observe(observation, action, reward)

        observation = new_observation
    
    agent.optimize()

    return reward

In [None]:
def random_policy(state):
    return np.random.randint(low=0, high=1, size=(1))[0]

def stick_policy(state):
    player_score = state[0]
    return 0 if player_score in [20, 21] else 1

agent = FirstVisitMonteCarloPrediction(gamma=1, policy=stick_policy)

In [None]:
for i in range(100_000):
    play_env(env, agent)

In [None]:
Z = np.zeros(shape=(22, 12)) * np.nan

for k in agent.state_value.keys():
    Z[k[0]][k[1]] = agent.state_value[k]

sh_0, sh_1 = Z.shape

x, y = np.linspace(0, sh_1, sh_1), np.linspace(0, sh_0, sh_0)

fig = go.Figure(data=[go.Surface(z=Z, x=x, y=y)])

fig.update_layout(title='MCFirstVisit',
                  autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90))

fig.show()

### 5.2 Monte Carlo Estimation of Action Values

### 5.3 Monte Carlo Control

In [None]:
class MonteCarloExploringStart():

    def __init__(self, action_space, gamma, policy):
        self.gamma  = gamma
        self.policy = policy

        self.state_action_value = collections.defaultdict(lambda: np.zeros((action_space.n)))
        self.returns = collections.defaultdict(lambda: 0)

        self.actions = []
        self.states = []
        self.rewards = []

        self.first_action = True

    def action(self, state):
        if self.first_action:
            self.first_action = False
            return np.random.choice(len(self.state_action_value[state]))
        else:
            return self.policy(self.state_action_value, state)
    
    def observe(self, state, action, reward):
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
    
    def optimize(self):
        g = 0

        for t in reversed(range(len(self.states))):
            g = self.gamma * g + self.rewards[t]

            if not self.states[t] in self.states[0:t]:
                self.returns[self.states[t]] += 1
                self.state_action_value[self.states[t]][self.actions[t]] += (1 / self.returns[self.states[t]]) * (g - self.state_action_value[self.states[t]][self.actions[t]])
        
        self.states = []
        self.rewards = []

        self.first_action = True

In [None]:
def greedy(state_action_value, state):
    return np.argmax(state_action_value[state])

def stick_policy(state_action_value, state):
    player_score = state[0]
    return 0 if player_score in [20, 21] else 1

env = gym.make('Blackjack-v1', natural=True, sab=False)
agent = MonteCarloExploringStart(action_space=env.action_space, gamma=1, policy=stick_policy)

In [None]:
for i in range(100_000):
    play_env(env, agent)

In [None]:
def compute_v_from_q(state_action_value):
    state_value = collections.defaultdict(lambda: np.ones((action_space.n)))

    for s in state_action_value:
        state_value[s] = np.mean(state_action_value[s])
    
    return state_value

state_value = compute_v_from_q(agent.state_action_value)

In [None]:
Z = np.zeros(shape=(22, 12)) * np.nan

for k in state_value.keys():
    Z[k[0]][k[1]] = state_value[k]

sh_0, sh_1 = Z.shape

x, y = np.linspace(0, sh_1, sh_1), np.linspace(0, sh_0, sh_0)

fig = go.Figure(data=[go.Surface(z=Z, x=x, y=y)])

fig.update_layout(title='Monte Carlo Exploring Start',
                  autosize=False,
                  width=500, height=500,
                  margin=dict(l=65, r=50, b=65, t=90))

fig.show()

## Monte Carlo Control without exploring starts

In [None]:
class MonteCarloExploringStart():

    def __init__(self, action_space, gamma, policy):
        self.gamma  = gamma
        self.policy = policy

        self.state_action_value = collections.defaultdict(lambda: np.zeros((action_space.n)))
        self.returns = collections.defaultdict(lambda: 0)

        self.actions = []
        self.states = []
        self.rewards = []

    def action(self, state):
        return self.policy(self.state_action_value, state)
    
    def observe(self, state, action, reward):
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
    
    def optimize(self):
        g = 0

        for t in reversed(range(len(self.states))):
            g = self.gamma * g + self.rewards[t]

            if not self.states[t] in self.states[0:t]:
                self.returns[self.states[t]] += 1
                self.state_action_value[self.states[t]][self.actions[t]] += (1 / self.returns[self.states[t]]) * (g - self.state_action_value[self.states[t]][self.actions[t]])
        
        self.states = []
        self.rewards = []

In [None]:
def get_epsilon_greedy_policy(epsilon=0.1):
    def epsilon_greedy_policy(state_action_value, state):
        take_random_action_prob = np.random.uniform(0, 1)

        if take_random_action_prob < epsilon:
            random_action = np.random.randint(0, len(state_action_value[state]))
            return random_action
        else:
            greedy_action = argmax(state_action_value[state])
            return greedy_action
    
    return epsilon_greedy_policy

## Off-policy Prediction via Importance Sampling

All learning control methods face a dilemma: They seek to learn action values conditional  
on subsequent optimal behavior, but they need to behave non-optimally in order to  
explore all actions (to find the optimal actions).

A more straightforward approach is to use two policies:
 - **target policy** -> one that is learned about and that becomes the optimal policy
 - **behavior policy** -> one that is more exploratory and is used to generate behavior

On-policy methods are generally simpler and are considered first.

Off-policy methods require additional concepts and notation, and because the data is due to a different policy,  
off-policy methods are often of greater variance and are slower to converge.

On the other hand, off-policy methods are more powerful and general.

For example, they can often be applied to learn from data generated by a conventional non-learning controller, or from a human expert.

Off-policy learning is also seen by some as key to learning multi-step predictive models of the world’s dynamics

$\pi$ is the target policy
$b$ is the behavior policy
and both policies areconsidered fixed and given.

It is require that $\pi(a|s) > 0$ implies $b(a|s) > 0$.  
This is called the assumption of **coverage**.

Almost all off-policy methods utilize importance sampling, a general technique for  
estimating expected values under one distribution given samples from another.

In [None]:
def softmax(state_action_value):
    e_x = np.exp(state_action_value - np.max(state_action_value))
    probs = e_x / e_x.sum(axis=0)
    return probs

class OffPolicyMonteCarloPrediction():

    def __init__(self, action_space, gamma, target_policy, behavior_policy):
        self.gamma  = gamma

        self.target_policy = target_policy
        self.behavior_policy = behavior_policy

        self.state_action_value = collections.defaultdict(lambda: np.zeros((action_space.n)))
        self.state_action_value_behavior = collections.defaultdict(lambda: np.ones((action_space.n)) / action_space.n)

        self.cumulative_weights = collections.defaultdict(lambda: np.zeros((action_space.n)))

        self.actions = []
        self.states = []
        self.rewards = []

    def action(self, state):
        return self.behavior_policy(self.state_action_value_behavior, state)
    
    def observe(self, state, action, reward):
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
    
    def optimize(self):
        g = 0
        w = 1

        for t in reversed(range(len(self.states))):
            g = self.gamma * g + self.rewards[t]

            self.cumulative_weights[self.states[t]][self.actions[t]] += w
            self.state_action_value[self.states[t]][self.actions[t]] += (w / self.cumulative_weights[self.states[t]][self.actions[t]]) * (g - self.state_action_value[self.states[t]][self.actions[t]])

            w = w * (softmax(self.state_action_value)[self.states[t]][self.actions[t]] / softmax(self.state_action_value_behavior)[self.states[t]][self.actions[t]])
        
        self.states = []
        self.actions = []
        self.rewards = []

In [None]:
class OffPolicyMonteCarloControl():

    def __init__(self, action_space, gamma, policy):
        self.gamma  = gamma
        self.policy = policy

        self.state_action_value = collections.defaultdict(lambda: np.zeros((action_space.n)))
        self.returns = collections.defaultdict(lambda: 0)

        self.actions = []
        self.states = []
        self.rewards = []

    def action(self, state):
        return self.policy(self.state_action_value, state)
    
    def observe(self, state, action, reward):
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)
    
    def optimize(self):
        g = 0

        for t in reversed(range(len(self.states))):
            g = self.gamma * g + self.rewards[t]

            if not self.states[t] in self.states[0:t]:
                self.returns[self.states[t]] += 1
                self.state_action_value[self.states[t]][self.actions[t]] += (1 / self.returns[self.states[t]]) * (g - self.state_action_value[self.states[t]][self.actions[t]])
        
        self.states = []
        self.rewards = []