# Monte Carlo Methods

<div>
<img src="images/Drunk.jpg" width="500"/>
</div>

## Introduction

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns.  We assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected. Only on the completion of an episode are value estimates and policies changed. Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense. The term “Monte Carlo” is often used more broadly for any estimation method whose operation involves a significant random component.

In the following example we use a Monte Carlo method to estimate pi: 

![Monte Carlo Prediction](images/pi_30k.webp)

## Monte Carlo Prediction

Monte Carlo methods can be used for learning the state-value
function for a *given* policy. The value of a state is the expected
return — expected cumulative future (discounted) reward — starting from that
state. An obvious way to estimate it from experience, is simply to
average the returns observed after visits to that state. As more returns are
observed, the average should converge to the expected value.    

![Monte Carlo Prediction](images/mc-pred.png)

## Monte Carlo Control

Monte Carlo estimation can be used in control, that is, to approximate optimal policies. The overall idea is to proceed according to the idea of generalized policy iteration (GPI). In GPI one maintains both an
approximate policy and an approximate value function. The value function is
repeatedly altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function.

We will implement a simple version of Policy improvement, which is done by making the policy greedy with respect to the current value function. In this case we have an action-value function, and
therefore no model is needed to construct the greedy policy. For any action-value function q, the corresponding greedy policy is the one that, for each s ∈ S, deterministically chooses an action with maximal action-value.

![MCC](images/mcc-alg.png)

### Implementaion

In [1]:
presentation = False

In [2]:
import itertools

from collections import deque

import random
import gym
import torch
from torch.utils.data import DataLoader
import numpy as np

from AI_agents.Environments.gym_problem import GymProblem
from AI_agents.Search.best_first_search import a_star

from IL.dataset import ImitationLearningDataset
from IL.evaluation import evaluate_policy
from IL.ipython_vis import animate_policy
from IL.model import MLP
from IL.training import train_torch_classifier_sgd
import AI_agents.Search.utils as utils


# initialize env
env = gym.make("Taxi-v3").env
env.reset()

PASSENGER_IN_TAXI = 4  # passenger idx when in taxi
locs = env.unwrapped.locs  # environment locations

# random seed
seed = 42

Use a policy that chooses an action randomaly at each step

In [3]:
class TaxiMonteCarloPolicy:
    def __init__(self):
        # a container for the plan actions.
        self.cur_plan = deque()
    
    def __call__(self, obs):
        # if out of actions (finished previous plan), or if observation is not in current plan,
        # create a new plan.
        taxi_prob = GymProblem(env, env.unwrapped.s)
        actions = list(taxi_prob.get_applicable_actions(utils.Node(utils.State(obs, False), None, None, 0)))
        chosen_action = random.choice(actions)
        return chosen_action
    
RandomTraveler = TaxiMonteCarloPolicy()

In [4]:
# This code will run forever until it is interrupted
#animate_policy(env, RandomTraveler)

We added a reward compenant to the trajectory, that is necessary to calculate best action per state

In [5]:
# trajectory struct
class Trajectory:
    def __init__(self, observations=None, actions=None, rewards=None):
        self.observations = observations or []
        self.actions = actions or []
        self.rewards = rewards or []
    
    def add_step(self, observation, action, reward):
        self.observations.append(observation)
        self.actions.append(action)
        self.rewards.append(reward)
        
    def __str__(self):
        return 'trajectory: ' + str(list(zip(self.observations, self.actions)))
    
    def __repr__(self):
        return str(self)

In [6]:
def get_trajectory(policy, max_trajectory_length=float('inf')):
    # init trajectory object
    trajectory = Trajectory()
    
    # get first observation
    obs = env.reset()
    
    # init first reward
    reward = 0
    # iterate and step in environment.
    # limit num actions for incomplete policies
    for i in itertools.count(start=1):
        action = policy(obs)
        old_obs = obs
        obs, reward, done, info = env.step(action)
        trajectory.add_step(old_obs, action, reward)
        
        if done or i >= max_trajectory_length:
            break
    
    return trajectory

trajectory = get_trajectory(RandomTraveler, 500)
trajectory

trajectory: [(429, 1), (329, 2), (349, 1), (249, 4), (249, 1), (149, 3), (149, 3), (149, 4), (149, 2), (169, 1), (69, 5), (69, 5), (69, 4), (69, 2), (89, 4), (89, 0), (189, 3), (169, 3), (149, 0), (249, 0), (349, 2), (349, 1), (249, 1), (149, 5), (149, 2), (169, 5), (169, 3), (149, 3), (149, 1), (49, 1), (49, 4), (49, 4), (49, 3), (49, 3), (49, 1), (49, 5), (49, 5), (49, 0), (149, 5), (149, 1), (49, 0), (149, 5), (149, 4), (149, 1), (49, 1), (49, 0), (149, 2), (169, 4), (169, 0), (269, 0), (369, 3), (369, 5), (369, 5), (369, 1), (269, 4), (269, 4), (269, 4), (269, 3), (249, 3), (229, 3), (209, 4), (209, 5), (209, 3), (209, 4), (209, 1), (109, 0), (209, 1), (109, 0), (209, 2), (229, 4), (229, 2), (249, 0), (349, 4), (349, 5), (349, 2), (349, 3), (329, 3), (329, 0), (429, 2), (449, 4), (449, 5), (449, 0), (449, 5), (449, 0), (449, 1), (349, 4), (349, 1), (249, 2), (269, 5), (269, 1), (169, 3), (149, 3), (149, 5), (149, 0), (249, 0), (349, 3), (329, 3), (329, 3), (329, 3), (329, 1), (229,

Collect episodes/trajectories 

In [7]:
def collect_data(policy, num_trajectories, max_trajectory_length=float('inf')):
    trajectories = []
    for _ in range(num_trajectories):
        trajectories.append(get_trajectory(policy, max_trajectory_length))

    return trajectories



For each state: calculate the action with the max average return and set that action as the policy for that state.

In [8]:
from collections import defaultdict

def build_decision_dict(raw_data):
    state_action_scores = defaultdict(lambda: defaultdict(lambda: []))
    for trajectory in raw_data:
        reward_sum = 0
        for state, action, reward in reversed(list(zip(trajectory.observations, trajectory.actions, trajectory.rewards))):
            reward_sum += reward
            state_action_scores[state][action].append(reward_sum)
            
    for state, action_values in state_action_scores.items():
        for action, values_list in action_values.items():
            state_action_scores[state][action] = np.mean(values_list)
        state_action_scores[state] = max(state_action_scores[state], key=state_action_scores[state].get)
    return state_action_scores
    

In [9]:
class MCCPolicy:
    def __init__(self, state_action_map):
        self.state_action_map = state_action_map
    
    def __call__(self, obs):
        # preprocess observation
        return self.state_action_map[obs]

In [10]:
import json

def calc_final_policy(learning_policy, num_trajectories, json_name=None):
    if presentation:
        if json_name is None:
            raise Exception("Can't present without filename")
        with open(json_name + ".json", 'r') as fp:
            state_action_map = json.load(fp)
        policy = MCCPolicy({int(key):value for key, value in state_action_map.items()})
    else:
        raw_data = collect_data(learning_policy, num_trajectories)
        policy = MCCPolicy(build_decision_dict(raw_data))
        if json_name is not None:
            with open(json_name + ".json", 'w') as fp:
                json.dump(policy.state_action_map, fp)
    return policy

In [11]:
# get the same trajectories every time!
env.seed(seed)

policy = calc_final_policy(RandomTraveler, 3000, "mcc_3000")

In [12]:
total_reward, mean_reward = evaluate_policy(env, RandomTraveler, num_episodes=10000, seed=seed)
print('Monte Carlo Policy')
print('---------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Policy
---------
total reward over all episodes: -981999
mean reward per episode:        -98.1999


In [13]:
total_reward, mean_reward = evaluate_policy(env, policy, num_episodes=10000, seed=seed)
print('Monte Carlo Control Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Control Policy
-----------------
total reward over all episodes: -1095707
mean reward per episode:        -109.5707


In [14]:
# This code will run forever until it is interrupted
# animate_policy(env, policy)

In [15]:
# get the same trajectories every time!
env.seed(seed)

policy = calc_final_policy(RandomTraveler, 30000, "mcc_30000")

In [16]:
total_reward, mean_reward = evaluate_policy(env, policy, num_episodes=10000, seed=seed)
print('Monte Carlo Control Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Control Policy
-----------------
total reward over all episodes: -428390
mean reward per episode:        -42.839


In [17]:
# This code will run forever until it is interrupted
# animate_policy(env, policy)

We introduce an improvement to this algorithm:
If an action from a certain state results in a step to that same state (no change by that action),
it is removed from the possible choices.

In [18]:
def get_non_stationary_actions(taxi_prob, obs):
    node = utils.Node(utils.State(obs, False), None, None, 0)
    actions = list(taxi_prob.get_applicable_actions(node))
    applicable_actions = []
    for action in actions:
        if taxi_prob.get_successors(action, node)[0].state.get_key() != obs:
            applicable_actions.append(action)
    return applicable_actions

class TaxiMoneCarloNonStationaryPolicy:
    def __init__(self):
        # a container for the plan actions.
        self.cur_plan = deque()
    
    def __call__(self, obs):
        # if out of actions (finished previous plan), or if observation is not in current plan,
        # create a new plan.
        taxi_prob = GymProblem(env, env.unwrapped.s)
        actions = get_non_stationary_actions(taxi_prob, obs)
        chosen_action = random.choice(actions)
        return chosen_action
    
nonstationary_policy = TaxiMoneCarloNonStationaryPolicy()

In [19]:
env.seed(seed)
nonstationary_control_policy = calc_final_policy(nonstationary_policy, 3000, "mcc_nonstationary_3000")

In [20]:
total_reward, mean_reward = evaluate_policy(env, nonstationary_control_policy, num_episodes=10000, seed=seed)
print('Monte Carlo Control Nonstationary Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Control Nonstationary Policy
-----------------
total reward over all episodes: -76400
mean reward per episode:        -7.64


In [21]:
# This code will run forever until it is interrupted
# animate_policy(env, nonstationary_control_policy)

In [22]:
env.seed(seed)
nonstationary_control_policy = calc_final_policy(nonstationary_policy, 30000, "mcc_nonstationary_30000")

In [23]:
total_reward, mean_reward = evaluate_policy(env, nonstationary_control_policy, num_episodes=10000, seed=seed)
print('Monte Carlo Control Nonstationary Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Control Nonstationary Policy
-----------------
total reward over all episodes: 49345
mean reward per episode:        4.9345


In [24]:
# This code will run forever until it is interrupted
# animate_policy(env, nonstationary_control_policy)

This idea can be expanded to removing "cycles" from the the episode that do not result in positive reward - this requires a more complex implementation.

## Summation

We discussed using Monte Carlo methods for both prediction/evaluation and improvement of policies.
The idea of Monte Carlo Control is to utilize both of these aspects in unison:

<div>
<img src="images/mcc-cycle.png" width="500"/>
</div>