# Reinforcement Learning - Monte Carlo
If you want to test/submit your solution **restart the kernel, run all cells and submit the mc_autograde.py file into codegrade.**

In [1]:
# This cell imports %%execwritefile command (executes cell and writes it into file). 
# All cells that start with %%execwritefile should be in mc_autograde.py file after running all cells.
from custommagics import CustomMagics
get_ipython().register_magics(CustomMagics)

In [2]:
%%execwritefile mc_autograde.py
import numpy as np
from collections import defaultdict
from tqdm import tqdm as _tqdm

def tqdm(*args, **kwargs):
    return _tqdm(*args, **kwargs, mininterval=1)  # Safety, do not overflow buffer

Overwriting mc_autograde.py


In [3]:
import matplotlib.pyplot as plt
import sys


%matplotlib inline

assert sys.version_info[:3] >= (3, 6, 0), "Make sure you have Python 3.6 installed!"

## 1. Monte Carlo Prediction

For the Monte Carlo Prediction we will look at the Blackjack game (Example 5.1 from the book), for which the `BlackjackEnv` is implemented in `blackjack.py`. Note that compared to the gridworld, the state is no longer a single integer, which is why we use a dictionary to represent the value function instead of a numpy array. By using `defaultdict`, each state gets a default value of 0.

In [4]:
from blackjack import BlackjackEnv
env = BlackjackEnv()

For the Monte Carlo algorithm, we no longer have transition probabilities and we need to *interact* with the environment. This means that we start an episode by using `env.reset` and send the environment actions via `env.step` to observe the reward and next observation (state).

In [5]:
# So let's have a look at what we can do in general with an environment...
import gym
# ?gym.Env

In [6]:
# We can also look at the documentation/implementation of a method
# ?env.step

In [7]:
# ??BlackjackEnv

A very simple policy for Blackjack is to *stick* if we have 20 or 21 points and *hit* otherwise. We want to know how good this policy is. This policy is *deterministic* and therefore a function that maps an observation to a single action. Technically, we can implement this as a dictionary , a function or a class with a function, where we use the last option. Moreover, it is often useful (as you will see later) to implement a function that returns  the probability $\pi(a|s)$ for the state action pair (the probability that this policy would perform certain action in given state). We group these two functions in a policy class. To get started, let's implement this simple policy for BlackJack.

In [8]:
%%execwritefile -a mc_autograde.py

class SimpleBlackjackPolicy(object):
    """
    A simple BlackJack policy that sticks with 20 or 21 points and hits otherwise.
    """
    def get_probs(self, states, actions):
        """
        This method takes a list of states and a list of actions and returns a numpy array that contains a probability
        of perfoming action in given state for every corresponding state action pair. 

        Args:
            states: a list of states.
            actions: a list of actions.

        Returns:
            Numpy array filled with probabilities (same length as states and actions)
        """
        probs = list()
        
        for i in range(len(states)):
            if states[i][0] in (20, 21):
                probs.append(not actions[i])
            else:
                probs.append(actions[i])
        
        return np.array(probs, dtype=int)
    
    def sample_action(self, state):
        """
        This method takes a state as input and returns an action sampled from this policy.  

        Args:
            state: current state

        Returns:
            An action (int).
        """
        probs = self.get_probs([state, state], [0, 1])
        return np.argmax(probs)


Appending to mc_autograde.py


In [9]:
# Let's check if it makes sense
env = BlackjackEnv()
s = env.reset()
policy = SimpleBlackjackPolicy()
print("State: {}\nSampled Action: {}\nProbabilities [stick, hit]: {}".format(s, policy.sample_action(s), policy.get_probs([s,s],[0,1])))

State: (20, 8, False)
Sampled Action: 0
Probabilities [stick, hit]: [1 0]


Since there are multiple algorithms which require data from single episode (or multiple episodes) it is often useful to write a routine that will sample a single episode. This will save us some time later. Implement a *sample_episode* function which uses environment and policy to sample a single episode.

In [10]:
%%execwritefile -a mc_autograde.py

def sample_episode(env, policy):
    """
    A sampling routine. Given environment and a policy samples one episode and returns states, actions, rewards
    and dones from environment's step function and policy's sample_action function as lists.

    Args:
        env: OpenAI gym environment.
        policy: A policy which allows us to sample actions with its sample_action method.

    Returns:
        Tuple of lists (states, actions, rewards, dones). All lists should have same length. 
        Hint: Do not include the state after the termination in the list of states.
    """
    states = []
    actions = []
    rewards = []
    dones = []
    
    s = env.reset()
    while(True not in dones):
        states.append(s)
        
        a = policy.sample_action(s)
        s, r, d, _ = env.step(a)
        
        actions.append(a)
        rewards.append(r)
        dones.append(d)
    
    return states, actions, rewards, dones

Appending to mc_autograde.py


In [11]:
def custom_sample(env, policy):
    """
    A sampling routine. Given environment and a policy samples one episode and returns states, actions, rewards
    and dones from environment's step function and policy's sample_action function as lists.

    Args:
        env: OpenAI gym environment.
        policy: A policy which allows us to sample actions with its sample_action method.

    Returns:
        Tuple of lists (states, actions, rewards, dones). All lists should have same length. 
        Hint: Do not include the state after the termination in the list of states.
    """
    return [(16, 10, False)], [1], [0], [True]

In [12]:
# Let's sample some episodes
env = BlackjackEnv()

policy = SimpleBlackjackPolicy()
for episode in range(20):
    trajectory_data = sample_episode(env, policy)
    print("Episode {}:\nStates {}\nActions {}\nRewards {}\nDones {}\n".format(episode,*trajectory_data))

Episode 0:
States [(14, 4, False), (20, 4, False)]
Actions [1, 0]
Rewards [0, -1]
Dones [False, True]

Episode 1:
States [(20, 8, False)]
Actions [0]
Rewards [1]
Dones [True]

Episode 2:
States [(20, 6, False)]
Actions [0]
Rewards [1]
Dones [True]

Episode 3:
States [(18, 10, False)]
Actions [1]
Rewards [-1]
Dones [True]

Episode 4:
States [(12, 10, False)]
Actions [1]
Rewards [-1]
Dones [True]

Episode 5:
States [(12, 8, False), (17, 8, False)]
Actions [1, 1]
Rewards [0, -1]
Dones [False, True]

Episode 6:
States [(21, 10, True)]
Actions [0]
Rewards [1]
Dones [True]

Episode 7:
States [(14, 3, False)]
Actions [1]
Rewards [-1]
Dones [True]

Episode 8:
States [(12, 1, False), (19, 1, False), (21, 1, False)]
Actions [1, 1, 0]
Rewards [0, 0, 1]
Dones [False, False, True]

Episode 9:
States [(13, 9, False), (19, 9, False)]
Actions [1, 1]
Rewards [0, -1]
Dones [False, True]

Episode 10:
States [(19, 7, True), (12, 7, False), (15, 7, False), (17, 7, False), (20, 7, False)]
Actions [1, 1, 1, 

Now implement the MC prediction algorithm (either first visit or every visit). Hint: you can use `for i in tqdm(range(num_episodes))` to show a progress bar. Use the sampling function from above to sample data from a single episode.

In [13]:
%%execwritefile -a mc_autograde.py

def mc_prediction(env, policy, num_episodes, discount_factor=1.0, sampling_function=sample_episode):
    """
    Monte Carlo prediction algorithm. Calculates the value function
    for a given policy using sampling.
    
    Args:
        env: OpenAI gym environment.
        policy: A policy which allows us to sample actions with its sample_action method.
        num_episodes: Number of episodes to sample.
        discount_factor: Gamma discount factor.
        sampling_function: Function that generates data from one episode.
    
    Returns:
        A dictionary that maps from state -> value.
        The state is a tuple and the value is a float.
    """

    # Keeps track of current V and count of returns for each state
    # to calculate an update.
    V = defaultdict(float)
    returns_count = defaultdict(float)
    
    for e in tqdm(range(num_episodes)):
        states, actions, rewards, dones = sampling_function(env, policy)

        G = 0
        for i in range(len(states))[::-1]:
            s = states[i]
            r = rewards[i]
            G = discount_factor * G + r
            
            returns_count[s] += 1
            V[s] += (G - V[s]) / returns_count[s]
            
    return V

Appending to mc_autograde.py


In [14]:
import pprint
V_10k = mc_prediction(env, SimpleBlackjackPolicy(), num_episodes=10000)
pprint.pprint(V_10k)

100%|██████████| 10000/10000 [00:00<00:00, 11677.77it/s]

defaultdict(<class 'float'>,
            {(12, 1, False): -0.5517241379310346,
             (12, 1, True): -0.125,
             (12, 2, False): -0.5887850467289719,
             (12, 2, True): -0.24999999999999997,
             (12, 3, False): -0.4431818181818183,
             (12, 3, True): 0.0,
             (12, 4, False): -0.5164835164835165,
             (12, 4, True): -0.8,
             (12, 5, False): -0.48235294117647065,
             (12, 5, True): -0.2,
             (12, 6, False): -0.568181818181818,
             (12, 6, True): -1.0,
             (12, 7, False): -0.4712643678160919,
             (12, 7, True): -0.3333333333333333,
             (12, 8, False): -0.5833333333333336,
             (12, 8, True): 1.0,
             (12, 9, False): -0.5543478260869564,
             (12, 9, True): 0.0,
             (12, 10, False): -0.6000000000000002,
             (12, 10, True): -0.7333333333333334,
             (13, 1, False): -0.6363636363636361,
             (13, 1, True): -0.249




Now make *4 plots* like Figure 5.1 in the book. You can either make 3D plots or heatmaps. Make sure that your results look similar to the results in the book. Give your plots appropriate titles, axis labels, etc.

In [15]:
%%time
# Let's run your code one time
# V_10k  = mc_prediction(env, SimpleBlackjackPolicy(), num_episodes=10000)
# V_500k = mc_prediction(env, SimpleBlackjackPolicy(), num_episodes=500000)

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 11.4 µs


In [16]:
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 unused import

import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter



In [17]:
V_aces = [(y, x, v) for (y, x, A), v in V_10k.items() if A]
V_nota = [(y, x, v) for (y, x, A), v in V_10k.items() if not A]


pprint.pprint(V_aces)

print(np.array(list(zip(*V_aces))))

X = np.arange(12, 21)
Y = np.arange(2, 11)
X, Y = np.meshgrid(X, Y)


[(14, 7, 0.27272727272727276),
 (19, 10, -0.5833333333333333),
 (21, 10, 0.9064327485380116),
 (16, 4, -0.7777777777777778),
 (17, 7, 0.0666666666666667),
 (15, 7, -0.33333333333333337),
 (20, 10, 0.4415584415584417),
 (21, 5, 0.838709677419355),
 (20, 7, 0.894736842105263),
 (20, 6, 0.4444444444444444),
 (16, 6, -0.38888888888888884),
 (21, 4, 0.8999999999999999),
 (15, 10, -0.5675675675675673),
 (16, 5, -0.23529411764705882),
 (18, 10, -0.42372881355932207),
 (21, 9, 0.9555555555555556),
 (15, 9, -0.8333333333333334),
 (14, 10, -0.25581395348837205),
 (13, 10, -0.19512195121951223),
 (18, 2, -0.1818181818181818),
 (21, 6, 0.9259259259259259),
 (14, 1, -0.49999999999999994),
 (21, 2, 0.8421052631578945),
 (16, 10, -0.4181818181818182),
 (12, 10, -0.7333333333333334),
 (12, 7, -0.3333333333333333),
 (20, 8, 0.8421052631578946),
 (19, 8, -0.15384615384615385),
 (17, 8, -0.0909090909090909),
 (21, 1, 0.653846153846154),
 (18, 5, -0.44999999999999996),
 (15, 5, -0.3333333333333333),
 (19,

## 2. Off-policy Monte Carlo prediction
In real world, it is often beneficial to learn from the experience of others in addition to your own. For example, you can probably infer that running off the cliff with a car is a bad idea if you consider what "return" people who have tried it received.

Similarly, we can benefit from the experience of other agents in reinforcement learning. In this exercise we will use off-policy monte carlo to estimate the value function of our target policy using the experience from a different behavior policy. Our target policy will be the simple policy defined above (stick if we have *20* or *21* points) and we will use a random policy that randomly chooses to stick or hit (both with 50% probability) as a behavior policy. As a first step, implement a random BlackJack policy.

In [18]:
%%execwritefile -a mc_autograde.py

class RandomBlackjackPolicy(object):
    """
    A random BlackJack policy.
    """
    def get_probs(self, states, actions):
        """
        This method takes a list of states and a list of actions and returns a numpy array that contains 
        a probability of perfoming action in given state for every corresponding state action pair. 

        Args:
            states: a list of states.
            actions: a list of actions.

        Returns:
            Numpy array filled with probabilities (same length as states and actions)
        """
        
        return np.array([0.5] * len(states))
    
    def sample_action(self, state):
        """
        This method takes a state as input and returns an action sampled from this policy.  

        Args:
            state: current state

        Returns:
            An action (int).
        """
        
        probs = self.get_probs([state, state], [0, 1])
        
        actions = np.random.choice(2, 1, p=probs)
        
        return actions[0]
            
        

Appending to mc_autograde.py


In [19]:
# Let's check if it makes sense
env = BlackjackEnv()
s = env.reset()
policy = RandomBlackjackPolicy()
print("State: {}\nSampled Action: {}\nProbabilities [stick, hit]: {}".format(s, policy.sample_action(s), policy.get_probs([s,s],[0,1])))

State: (13, 4, False)
Sampled Action: 0
Probabilities [stick, hit]: [0.5 0.5]


Now implement the MC prediction algorithm with ordinary importance sampling. Use the sampling function from above to sample data from a single episode.

Hint: Get probs functions may be handy. You can use `for i in tqdm(range(num_episodes))` to show a progress bar.

In [20]:
%%execwritefile -a mc_autograde.py

def mc_importance_sampling(env, behavior_policy, target_policy, num_episodes, discount_factor=1.0,
                           sampling_function=sample_episode):
    """
    Monte Carlo prediction algorithm. Calculates the value function
    for a given target policy using behavior policy and ordinary importance sampling.
    
    Args:
        env: OpenAI gym environment.
        behavior_policy: A policy used to collect the data.
        target_policy: A policy which value function we want to estimate.
        num_episodes: Number of episodes to sample.
        discount_factor: Gamma discount factor.
        sampling_function: Function that generates data from one episode.
    
    Returns:
        A dictionary that maps from state -> value.
        The state is a tuple and the value is a float.
    """

    # Keeps track of current V and count of returns for each state
    # to calculate an update.
    V = defaultdict(float)
    returns_count = defaultdict(float)
    
    for e in tqdm(range(num_episodes)):
        states, actions, rewards, dones = sampling_function(env, behavior_policy)
        
        G = 0
        W = 1
        for i in range(len(states))[::-1]:
            s = states[i]
            r = rewards[i]
            a = actions[i]
            
            G = discount_factor * G + r
            
            returns_count[s] += 1
            W *= target_policy.get_probs([s, s], [0, 1])[a] / behavior_policy.get_probs([s, s], [0, 1])[a]
            V[s] += 1 / returns_count[s] * (W * G - V[s])
                        
            
    return V

Appending to mc_autograde.py


In [21]:
Vi_10k = mc_importance_sampling(env, RandomBlackjackPolicy(), SimpleBlackjackPolicy(), num_episodes=1, sampling_function=custom_sample)
pprint.pprint(Vi_10k)

100%|██████████| 1/1 [00:00<00:00, 1215.04it/s]

defaultdict(<class 'float'>, {(16, 10, False): 0.0})





In [22]:
%%time
# Let's run your code one time
# V_10k = mc_importance_sampling(env, RandomBlackjackPolicy(), SimpleBlackjackPolicy(), num_episodes=10000)
# V_500k = mc_importance_sampling(env, RandomBlackjackPolicy(), SimpleBlackjackPolicy(), num_episodes=500000)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 8.34 µs


Plot the V function. Do the plots look like what you expected?

In [23]:
np.random.seed(100)

# np.random.choice(2, 1, [0.5, 0.5])
np.random.uniform() > 0.5

True

If you want to test/submit your solution **restart the kernel, run all cells and submit the mc_autograde.py file into codegrade.**