# Monte Carlo Methods

<div>
<img src="images/Drunk.jpg" width="500"/>
</div>

## Introduction

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns.  We assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected. Only on the completion of an episode are value estimates and policies changed. Monte Carlo methods can thus be incremental in an episode-by-episode sense, but not in a step-by-step (online) sense. The term “Monte Carlo” is often used more broadly for any estimation method whose operation involves a significant random component.

In the following example we use a Monte Carlo method to estimate pi: 

![Monte Carlo Prediction](images/pi_30k.webp)

## Monte Carlo Prediction

Monte Carlo methods can be used for learning the state-value
function for a *given* policy. The value of a state is the expected
return — expected cumulative future (discounted) reward — starting from that
state. An obvious way to estimate it from experience, is simply to
average the returns observed after visits to that state. As more returns are
observed, the average should converge to the expected value.    

![Monte Carlo Prediction](images/mc-pred.png)
[Reinforcement Learning: An Introduction - Sutton and Barto]

## Monte Carlo Control

Monte Carlo estimation can be used in control, that is, to approximate optimal policies. The overall idea is to proceed according to the idea of generalized policy iteration (GPI). In GPI one maintains both an
approximate policy and an approximate value function. The value function is
repeatedly altered to more closely approximate the value function for the current policy, and the policy is repeatedly improved with respect to the current value function.

We will implement a simple version of Policy improvement, which is done by making the policy greedy with respect to the current value function. In this case we have an action-value function, and
therefore no model is needed to construct the greedy policy. For any action-value function q, the corresponding greedy policy is the one that, for each s ∈ S, deterministically chooses an action with maximal action-value.

![MCC](images/mcc-alg.png)
[Reinforcement Learning: An Introduction - Sutton and Barto]

### Implementaion

In [1]:
## Presentation mode: If true will load saved policy dictionaries, otherwise will generate
## new trajectories to build a policy(may take a long time)
presentation = True

In [2]:
!pip install git+https://github.com/sarah-keren/AI_agents

Collecting git+https://github.com/sarah-keren/AI_agents
  Cloning https://github.com/sarah-keren/AI_agents to c:\users\shyur\appdata\local\temp\pip-req-build-w00getx1
  Resolved https://github.com/sarah-keren/AI_agents to commit e325c1e2ab248717665822a03c60c1dd8d067a90
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none --quiet https://github.com/sarah-keren/AI_agents 'C:\Users\shyur\AppData\Local\Temp\pip-req-build-w00getx1'


In [3]:
import itertools

import random
import gym
import numpy as np

from AI_agents.Environments.gym_problem import GymProblem

from Utilities.evaluation import evaluate_policy
from Utilities.ipython_vis import animate_policy
import AI_agents.Search.utils as utils


# initialize env
env = gym.make("Taxi-v3").env
env.reset()

PASSENGER_IN_TAXI = 4  # passenger idx when in taxi
locs = env.unwrapped.locs  # environment locations

# random seed
seed = 42

Use a policy that chooses an action randomaly at each step in order to generate a trajectory database

In [4]:
class TaxiMonteCarloPolicy:
    def __call__(self, obs):
        # if out of actions (finished previous plan), or if observation is not in current plan,
        # create a new plan.
        taxi_prob = GymProblem(env, env.unwrapped.s)
        actions = list(taxi_prob.get_applicable_actions(utils.Node(utils.State(obs, False), None, None, 0)))
        chosen_action = random.choice(actions)
        return chosen_action
    
RandomTraveler = TaxiMonteCarloPolicy()

Animation of a random trajectory:

In [5]:
# This code will run forever until it is interrupted
# animate_policy(env, RandomTraveler)

We use trajectory and episode interchangeably in this context.
We added a reward component to the trajectory, that is necessary to calculate best action per state.

Trajectory steps are represented as (observation/state, action, reward)

In [6]:
# trajectory struct
class Trajectory:
    def __init__(self, observations=None, actions=None, rewards=None):
        self.observations = observations or []
        self.actions = actions or []
        self.rewards = rewards or []
    
    def add_step(self, observation, action, reward):
        self.observations.append(observation)
        self.actions.append(action)
        self.rewards.append(reward)
        
    def __str__(self):
        return 'trajectory: ' + str(list(zip(self.observations, self.actions, self.rewards)))
    
    def __repr__(self):
        return str(self)

A function for creating one episode/trajectory:

In [7]:
def get_trajectory(policy, max_trajectory_length=float('inf')):
    # init trajectory object
    trajectory = Trajectory()
    
    # get first observation
    obs = env.reset()
    
    # init first reward
    reward = 0
    # iterate and step in environment.
    # limit num actions for incomplete policies
    for i in itertools.count(start=1):
        action = policy(obs)
        # we register the observation with the action that acted upon it
        # and the reward it got, and save the next observation
        old_obs = obs
        obs, reward, done, info = env.step(action)
        trajectory.add_step(old_obs, action, reward)
        
        if done or i >= max_trajectory_length:
            break
    
    return trajectory

trajectory = get_trajectory(RandomTraveler, 500)
trajectory

trajectory: [(92, 2, -1), (92, 4, -10), (92, 5, -10), (92, 0, -1), (192, 5, -10), (192, 3, -1), (172, 3, -1), (152, 0, -1), (252, 2, -1), (272, 4, -10), (272, 4, -10), (272, 5, -10), (272, 5, -10), (272, 1, -1), (172, 3, -1), (152, 2, -1), (172, 1, -1), (72, 4, -10), (72, 2, -1), (92, 0, -1), (192, 0, -1), (292, 2, -1), (292, 2, -1), (292, 4, -10), (292, 2, -1), (292, 2, -1), (292, 2, -1), (292, 0, -1), (392, 5, -10), (392, 1, -1), (292, 1, -1), (192, 0, -1), (292, 0, -1), (392, 3, -1), (372, 0, -1), (472, 3, -1), (472, 4, -1), (476, 1, -1), (376, 2, -1), (396, 4, -10), (396, 5, -10), (396, 5, -10), (396, 4, -10), (396, 5, -10), (396, 0, -1), (496, 4, -10), (496, 5, -10), (496, 1, -1), (396, 1, -1), (296, 1, -1), (196, 4, -10), (196, 2, -1), (196, 4, -10), (196, 5, -10), (196, 4, -10), (196, 3, -1), (176, 2, -1), (196, 5, -10), (196, 0, -1), (296, 2, -1), (296, 5, -10), (296, 5, -10), (296, 2, -1), (296, 0, -1), (396, 5, -10), (396, 0, -1), (496, 5, -10), (496, 4, -10), (496, 1, -1), (

Collect episodes/trajectories 

In [8]:
def collect_data(policy, num_trajectories, max_trajectory_length=float('inf')):
    trajectories = []
    for _ in range(num_trajectories):
        trajectories.append(get_trajectory(policy, max_trajectory_length))

    return trajectories



For each state: calculate the action with the max average return and set that action as the policy for that state.

In [9]:
from collections import defaultdict

def build_decision_dict(raw_data):
    # Nested dictionary for: State -> Action -> Reward List 
    state_action_scores = defaultdict(lambda: defaultdict(lambda: []))
    for trajectory in raw_data:
        reward_sum = 0
        # iterate backwards to calculate the return G of each observed state action pair
        for state, action, reward in reversed(list(zip(trajectory.observations, trajectory.actions, trajectory.rewards))):
            reward_sum += reward
            state_action_scores[state][action].append(reward_sum)
            
    for state, action_values in state_action_scores.items():
        for action, values_list in action_values.items():
            # Calculate the mean of all returns for a state action pair
            state_action_scores[state][action] = np.mean(values_list)
        # For each state choose the action with the highest mean return
        state_action_scores[state] = max(state_action_scores[state], key=state_action_scores[state].get)
    return state_action_scores
    
class MCCPolicy:
    def __init__(self, state_action_map):
        self.state_action_map = state_action_map
    
    def __call__(self, obs):
        # preprocess observation
        return self.state_action_map[obs]

Construct a learned policy from the state-action map we get from calculating mean returns with the training policy.

In [10]:
import json

def calc_final_policy(learning_policy, num_trajectories, json_name=None):
    # If presentation flag is on, json_name will be loaded from the environment (if it exists)
    # otherwise new num_trajectories trajectories will be generated to train the policy
    if presentation:
        if json_name is None:
            raise Exception("Can't present without filename")
        with open(json_name + ".json", 'r') as fp:
            state_action_map = json.load(fp)
        policy = MCCPolicy({int(key):value for key, value in state_action_map.items()})
    else:
        raw_data = collect_data(learning_policy, num_trajectories)
        policy = MCCPolicy(build_decision_dict(raw_data))
        if json_name is not None:
            with open(json_name + ".json", 'w') as fp:
                json.dump(policy.state_action_map, fp)
    return policy

Generate a policy with 3000 episodes and evaluate its performance. We can see that it's not very good:

In [11]:
# get the same trajectories every time!
env.seed(seed)

policy = calc_final_policy(RandomTraveler, 3000, "mcc_3000")

In [12]:
total_reward, mean_reward = evaluate_policy(env, RandomTraveler, num_episodes=10000, seed=seed)
print('Monte Carlo Policy')
print('---------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Policy
---------
total reward over all episodes: -983168
mean reward per episode:        -98.3168


In [13]:
total_reward, mean_reward = evaluate_policy(env, policy, num_episodes=10000, seed=seed)
print('Monte Carlo Control Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Control Policy
-----------------
total reward over all episodes: -1095707
mean reward per episode:        -109.5707


In [14]:
# This code will run forever until it is interrupted
# animate_policy(env, policy)

Generate a policy with 30000. We can see it achieves better results.

In [15]:
# get the same trajectories every time!
env.seed(seed)

policy = calc_final_policy(RandomTraveler, 30000, "mcc_30000")

In [16]:
total_reward, mean_reward = evaluate_policy(env, policy, num_episodes=10000, seed=seed)
print('Monte Carlo Control Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Control Policy
-----------------
total reward over all episodes: -428390
mean reward per episode:        -42.839


In [17]:
# This code will run forever until it is interrupted
#animate_policy(env, policy)

### Non-Stationary improvement

We noticed that in the previous results the agent tends to get stuck on performing dead-end actions, i.e actions that result in transitioning to the same state.

We introduce an improvement to the algorithm:
If an action from a certain state results in a step to that same state (no change by that action), it is removed from the possible choices.

In [18]:
def get_non_stationary_actions(taxi_prob, obs):
    node = utils.Node(utils.State(obs, False), None, None, 0)
    actions = list(taxi_prob.get_applicable_actions(node))
    applicable_actions = []
    # if an action results in staying in the same state i.e it's stationary, we remove
    # that action.
    for action in actions:
        if taxi_prob.get_successors(action, node)[0].state.get_key() != obs:
            applicable_actions.append(action)
    return applicable_actions

class TaxiMoneCarloNonStationaryPolicy:
    
    def __call__(self, obs):
        # if out of actions (finished previous plan), or if observation is not in current plan,
        # create a new plan.
        taxi_prob = GymProblem(env, env.unwrapped.s)
        actions = get_non_stationary_actions(taxi_prob, obs)
        chosen_action = random.choice(actions)
        return chosen_action
    
nonstationary_policy = TaxiMoneCarloNonStationaryPolicy()

Generate a policy with 3000. It is already much better than the original run.

In [19]:
env.seed(seed)
nonstationary_control_policy = calc_final_policy(nonstationary_policy, 3000, "mcc_nonstationary_3000")

In [20]:
total_reward, mean_reward = evaluate_policy(env, nonstationary_control_policy, num_episodes=10000, seed=seed)
print('Monte Carlo Control Nonstationary Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Control Nonstationary Policy
-----------------
total reward over all episodes: -76400
mean reward per episode:        -7.64


In [21]:
# This code will run forever until it is interrupted
#animate_policy(env, nonstationary_control_policy)

Generate a policy with 30000 trajectories.

In [22]:
env.seed(seed)
nonstationary_control_policy = calc_final_policy(nonstationary_policy, 30000, "mcc_nonstationary_30000")

In [23]:
total_reward, mean_reward = evaluate_policy(env, nonstationary_control_policy, num_episodes=10000, seed=seed)
print('Monte Carlo Control Nonstationary Policy')
print('-----------------')
print(f'total reward over all episodes: {total_reward}')
print(f'mean reward per episode:        {mean_reward}')

  0%|          | 0/10000 [00:00<?, ?it/s]

Monte Carlo Control Nonstationary Policy
-----------------
total reward over all episodes: 49345
mean reward per episode:        4.9345


In [24]:
# This code will run forever until it is interrupted
# animate_policy(env, nonstationary_control_policy)

This idea can be expanded to removing "cycles" from the the episode that do not result in positive reward - this requires a more complex implementation.

## Summation

We discussed using Monte Carlo methods for both prediction/evaluation and improvement of policies.
The idea of Monte Carlo Control is to utilize both of these aspects in unison:

<div>
<img src="images/mcc-cycle.png" width="400"/>
</div>
[Reinforcement Learning: An Introduction - Sutton and Barto]