# Introduction To Monte Carlo Reinforcement Learning

## CSCI E-82A

## Stephen Elston

Starting with this lesson, we will turn our attention to a **reinforcement learning**. Reinforcement learning is a distinctive type of machine learning and distinct from supervised learning and unsupervised learning. 

Reinforcement learning has several distinctive characteristics, which differentiate this method from other machine learning and dynamic programming:
- **No Markov model** is required for reinforcement learning, in contrast to dynamic programming.
- Like dynamic programming, reinforcement learning **optimizes a reward function**. This is in contrast to supervised and unsupervised learning which use an error or objective function.  
- Reinforcement learning algorithms learn by **experience**. Over time, the algorithm learns a model of the environment and these results are used to optimize the expected reward. Learning from experience is in contrast to supervised learning which uses known marked cases. 

The interaction between a reinforcement learning agent and the environment are illustrated in the figure below. Notice that the only feedback the agent receives from the environment is the reward. There is no other evidence.   

<img src="img/RL_Agent_Model.JPG" alt="Drawing" style="width:500px; height:300px"/>
<center> **Reinforcement Learning Agent and Environment** </center>  


**Suggested readings** for Monte Carlo reinforcement learning Chapter 5 of Sutton and Barto provides a good introduction, including many alternative algorithms not discussed here.   

## Overview of Monte Carlo Reinforcement Learning

A wide variety of reinforcement learning algorithms have been developed over the past few decades. In this lesson we will explore the basics of the Monte Carlo method. Monte Carlo algorithms have been known for most of the history of reinforcement learning. However, they are generally considered inefficient for several reasons that will become apparent as we proceed:
1. Monte Carlo methods rely large numbers of **random samples** to produce estimates. Thus, Monte Carlo algorithms are inherently computationally intensive. 
2. Monte Carlo reinforcement learning algorithms must **complete an entire episode** before an estimate can be produce. 

### Basics of Monte Carlo Simulation

Monte Carlo sampling was developed in the 1940s. Originally, Monte Carlo methods were used to compute estimates of complex functions which where analytically intractable. The basic idea is to **compute an estimate** of a complex function by **averaging a large number of samples**. 

Monte Carlo methods rely on the [**weak law of large numbers**](https://en.wikipedia.org/wiki/Law_of_large_numbers). The law of large numbers is a theorem that states that statistics of independent samples converge to the population values as more unbiased experiments are performed. We can write this mathematically for the **expected value** pr mean as:

$$Let\ \bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i\\
then\ by\ the\ law\ of\ Large\ Numbers\\
\bar{X} \rightarrow E(X) = \mu\\
as\\
n \rightarrow \infty$$

Thus, if we sample some process $X$ enough times (possibly infinite), we can compute the expected value from these samples. 

### Monte Carlo Reinforcement Learning

But, how do we apply Monte Carlo sampling to reinforcement learning? More specifically, how do we apply Monte Carlo sampling to **episodic** reinforcement learning tasks. 

To understand this algorithm, it helps to examine the backup diagram shown below. This diagram shows Monte Carlo sampling of a single episode.    

<img src="img/MC_Backup.JPG" alt="Drawing" style="width:75px; height:400px"/>
<center> **Backup Diagram for Monte Carlo Reinforcement Learning** </center>  

Starting at the top of the diagram the system is in a state, s. An action, a, causes a transition to a new state. The sampling of the episode proceeds until the terminal state, t, is reached. The return for the initial state can only be computed once the Monte Carlo backup **ends at the terminal state**. In other words, **Monte Carlo algorithms do not bootstrap**.  

In reinforcement learning we do not know the model. But, the agent can take a series of actions and find the rewards for these actions. For each episode the agent will accumulate the history of rewards for each action. 

Recall that for a finite or episodic Markov reward processes we define the **return** for state transitions starting with the current state. The return is the sum of the rewards for the $T$ future states transitions of the episodic process, and can be expressed as:

$$G_t = R_{t+1} + R_{t+2} + \ldots = R_{T}= \sum_{k = 0}^{T} R_{t+k+1}$$ 

Thus, for any episode the Monte Carlo algorithm will sample the return for the states visited. Over a large (actually infinite) number of episodes the Monte Carlo algorithm will sample each action value several times. The sampled return values are then averaged for each state action. This process will converge to the actual action values, which are **unobservable** directly. 

For sampling a single episode of a Markov process, the Monte Carlo algorithm may or may not visit a state one or more times. The question then becomes, How should returns be computed if a state is visited more than once in an episode? There are two options each of which has different statical convergence properties:
1. **First visit** Monte Carlo only estimates returns from the first visit to a state to termination. We will use first visit Monte Carlo in this lesson.
2. **Every visit** Monte Carlo accumulates the returns for any visit to a state in an episode.


## Example of First Visit Monte Carlo RL

With this short introduction MC RL learning in mind, tet's try an example. We will sample the action value function using a simple MC algorithm here. 

**Navigation** to a goal is a significant problem in robotics. Real-world navigation is rather complex. Therefore, in this example we will use a simple analog called a **grid world**. The grid world for this problem is shown below. 

<img src="img/GridWorld.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **A 4x4 Grid World with Terminal State** </center>

The grid world consists of a 4x4 set of positions the robot can occupy. Each position is considered a state. The goal is to navigate to state 0, the goal, in the minimum steps. We will explore methods to find policies which reach this goal and achieve maximum reward. 

Grid position 0 is the goal and a **terminal state**. There are no possible state transitions out of this position. The presence of a terminal state makes this an **episodic Markov random process**. For each episode sampled the robot can start in any other random position, $\{ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \}$. This random selection process makes this a **random start** Monte Carlo algorithm. The episode terminates when the robot enters the terminal position (state 0).  

### Representations For MC RL

In each state, there are four possible actions the robot can take:
- up, u
- down, d,
- left, l
- right, r

The MC RL agent has no model. Therefore, beyond these allowed actions, all other information is encapsulated in the environment and is unobservable by the agent. This is the key difference between reinforcement learning and dynamic programming. 

In reality, an RL agent may need to explore to find the possible actions when it is in some particular state. To simplify our example, we encode, or represent, these possibilities in a dictionary as shown in the code block below. We use a dictionary of dictionaries to perform the lookup. The keys of the outer dictionary are the identifiers (numbers) of the states. The keys of the inner dictionary are the possible actions and the values are the **successor state**, $s'$, for that transition.  


In [None]:
## import numpy for latter
import numpy as np
import numpy.random as nr

## Define the transition dictonary of dictionaries:
neighbors = {0:{'u':0, 'd':0, 'l':0, 'r':0},
          1:{'u':1, 'd':5, 'l':0, 'r':2},
          2:{'u':2, 'd':6, 'l':1, 'r':3},
          3:{'u':3, 'd':7, 'l':2, 'r':3},
          4:{'u':0, 'd':8, 'l':4, 'r':5},
          5:{'u':1, 'd':9, 'l':4, 'r':6},
          6:{'u':2, 'd':10, 'l':5, 'r':7},
          7:{'u':3, 'd':11, 'l':6, 'r':7},
          8:{'u':4, 'd':12, 'l':8, 'r':9},
          9:{'u':5, 'd':13, 'l':8, 'r':10},
          10:{'u':6, 'd':14, 'l':9, 'r':11},
          11:{'u':7, 'd':15, 'l':10, 'r':11},
          12:{'u':8, 'd':12, 'l':12, 'r':13},
          13:{'u':9, 'd':13, 'l':12, 'r':14},
          14:{'u':10, 'd':14, 'l':13, 'r':15},
          15:{'u':11, 'd':15, 'l':14, 'r':15}}

When performing MC RL, we can start with an arbitrary initial policy. The MC RL agent will then improve this policy and in the process will **learn the Markov process model**. Again, this is a key difference with dynamic programming where this model is specified. 

We need to define the transition probabilities for the initial policy. We set the probabilities for each transition as a **uniform distribution** leading to random action by the robot. As there are 4 possible transitions from each state, this means all transition probabilities are 0.25. In other words, this is a random policy which does not favor any particular plan. 

The initial uniform transition probabilities are encoded using a dictionary of dictionaries. The organization of this data structure is identical to the foregoing data structure. 

In [None]:
policy = {0:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0},
                        1:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}, 
                        2:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        3:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        4:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        5:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        6:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        7:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        8:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        9:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        10:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        11:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        12:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        13:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        14:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        15:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}}


The robot receives the following rewards:
- 10 for entering position 0. 
- -1 for attempting to leave the grid. In other words, we penalize the robot for hitting the edges of the grid.  
- -0.1 for all other state transitions, which is the cost for the robot to move from one state to another. If we did not have this penalty, the robot could follow any random plan to the goal which did not hit the edges. 

This **reward structure is unknown to the MC RL agent**. The agent must **learn** the rewards by sampling the environment. Here the rewards are in the form of action values. We encode these rewards in the same type of dictionary structure used for the foregoing structures. 

In [None]:
rewards = {0:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0},
          1:{'u':-1, 'd':-0.1, 'l':10.0, 'r':-0.1},
          2:{'u':-1.0, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1.0, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          4:{'u':10.0, 'd':-0.1, 'l':-1.0, 'r':-0.1},
          5:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          6:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          7:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          8:{'u':-0.1, 'd':-0.1, 'l':-1.0, 'r':-0.1},
          9:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          10:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-0.1},
          11:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1.0},
          12:{'u':-0.1, 'd':-1.0, 'l':-1.0, 'r':-0.1},
          13:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-0.1},
          14:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-0.1},
          15:{'u':-0.1, 'd':-1.0, 'l':-0.1, 'r':-1.0}}

The MC RL agent will sample state action values in a random manner. The RL agent does not know how many states there are and which states are terminal. Thus, the agent performs a random walk until the terminal state is reached. The probabilities of a particular state transition are determined by the current policy. 

### MC Policy Evaluation

We are using random start MC. The code in the cell below generates a random walk from the starting state to the terminal state. This process is **not part of the agent** since it does not know which state transitions are possible from its current state. You can understand the details of this function by reading the code comments. Execute this code and examine the results. 

In [None]:
def MC_generate_episode(policy, neighbors, terminal):
    ## List of states which might be visited in episode
    n_states = len(policy)
#    visited_state = [0] * n_states
    states = list(neighbors.keys())
    
    ## Random starting state for this episode, but can't be the terminal state
    current_state = nr.choice(states, size = 1)[0]
    while(current_state == terminal): # Keep trying to not use terminal state to start
        current_state = nr.choice(states, size = 1)[0]
            
    ## Take a random walk trough the states until we get to the terminal state
    ## We do some bookkeeping to ensure we only visit states once.
    visited = [] # List of states visited on random walk
    while(current_state != terminal): # Stop when at terminal state
        ## Probability of state transition given policy
        probs = list(policy[current_state].values())
        ## Find next state to transition to
        next_state = nr.choice(list(neighbors[current_state].values()), size = 1, p = probs)[0]
        visited.append(next_state)
        current_state = next_state  
    return(visited)    
    
nr.seed(4567)    
MC_generate_episode(policy, neighbors, 0) 

The random walk generated may visit the same state a number of times. Eventually the terminal state, 0, is reached and the walk is over. 

The random walk generated by the above function defines one episode. By generating a large number of episodes we can perform **Monte Carlo policy evaluation**. For each episode the returns for each state visited are accumulated, starting with the **first visit**. Once all episodes have concluded, the average returns are computed. 

Execute this code to evaluate the initial uniformly distributed policy. 

In [None]:
def MC_state_values(policy, neighbors, rewards, terminal, episodes = 1):
    '''Function for first visit Monte Carlo on GridWorld.'''
    ## Create list of states 
    states = list(policy.keys())
    n_states = len(states)
    
    ## An array to hold the accumulated returns as we visit states
    G = np.zeros((episodes,n_states))
    
    ## An array to keep track of how many times we visit each state so we can 
    ## compute the mean
    n_visits = np.zeros((n_states))
    
    ## Iterate over the episodes
    for i in range(episodes):
        ## For each episode we use a list to keep track of states we have visited.
        ## Once we visit a state we need to accumulate values to get the returns
        states_visited = []
   
        ## Get a path for this episode
        visit_list = MC_generate_episode(policy, neighbors, terminal)
        current_state = visit_list[0]
        for state in visit_list[0:]: 
            ## list of states we can transition to from current state
            transition_list = list(neighbors[current_state].values())
            
            if(state in transition_list): # Make sure the transistion is allowed
                transition_index = transition_list.index(state)   
  
                ## find the action value for the state transition
                v_s = list(rewards[current_state].values())[transition_index]
   
                ## Mark that the current state has been visited 
                if(state not in states_visited): states_visited.append(current_state)  
                ## Loop over the states already visited to add the value to the return
                for visited in states_visited:
                    G[i,visited] = G[i,visited] + v_s
                    n_visits[visited] = n_visits[visited] + 1.0
            ## Update the current state for next transition
            current_state = state   
    
    ## Compute the average of G over the episodes are return
    n_visits = [nv if nv != 0.0 else 1.0 for nv in n_visits]
    returns = np.divide(np.sum(G, axis = 0), n_visits)   
    return(returns)              
    
#nr.seed(335)
returns = MC_state_values(policy, neighbors, rewards, terminal = 0, episodes = 1000)
np.array(returns).reshape((4,4))

These results look promising. The returns become smaller the further the state is from the goal. 


### Policy Improvement

RL MC uses the idea of **generalized policy improvement**. Recall that GPI divides the policy improvement and evaluation steps into opposing processes and iterates between them. This process can be done in a quite granular way, even on a single episode or even state at a time. The figure below illustrates the concept of GPI.  

<img src="img/GPI.JPG" alt="Drawing" style="width:250px; height:250px"/>
<center> **Concept of Generalized Policy Improvement** </center>

The code in the cell below uses the GPI method to improve the policy for the grid world using GPI. The outer loop or cycle performs one iteration of GPI which has two steps: 
1. At the start of the loop the returns for the current policy are evaluated. In this case we use the average of several MC episodes. 
2. Next, the policy is updated using the new return values. The policy is improved by increasing the transition probabilities for actions (transitions) with higher reward. To ensure that the algorithm **continues to explore**, transition probabilities are never set to 0 but rather to a small minimum value $\epsilon$. 
 
Addition details on the operation of this algorithm can be obtained by reading the comments. Execute this code and examine the results.  

In [None]:
import copy
def MC_optimal_policy(policy, neighbors, rewards, terminal, episodes = 10, cycles = 10, epsilon = 0.05):
    ## Create a working cooy of the initial policy
    current_policy = copy.deepcopy(policy)
    
    ## Loop over a number of cycles of GPI
    for _ in range(cycles):
        ## First compute the average returns for each of the states. 
        ## This is the policy evaluation phase
        returns = MC_state_values(current_policy, neighbors, rewards, terminal = terminal, episodes = episodes)
        
        ## We want max Q for each state, where Q is just the difference 
        ## in the values of the possible state transition
        ## This is the policy evaluation phase
        for s in current_policy.keys(): # iterate over all states
            ## Compute Q for each possible state transistion
            ## Start by creating a list of the adjacent states.
            possible_s_prime = neighbors[s]
            neighbor_states = list(possible_s_prime.values())
            ## Check if terminal state is neighbor, but state is not terminal.
            if(terminal in neighbor_states and s != terminal):
                ## account for the special case adjacent to goal
                neighbor_Q = []
                for s_prime in possible_s_prime.keys(): # Iterate over adjacent states
                    if(neighbors[s][s_prime] == terminal):  
                         neighbor_Q.append(returns[s])
                    else: neighbor_Q.append(0.0) ## Other transisions have 0 value.   
            else: 
                 ## The other case is rather easy. Compute Q for the transistion to each neighbor           
                 neighbor_values = returns[neighbor_states]
                 neighbor_Q = [n_val - returns[s] for n_val in neighbor_values]
                
            ## Find the index for the state transistions with the largest values 
            ## May be more than one. 
            max_index = np.where(np.array(neighbor_Q) == max(neighbor_Q))[0]  
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(len(np.array(neighbor_Q)))
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder          
                   
    return current_policy
 
nr.seed(9876)    
MC_policy = MC_optimal_policy(policy, neighbors, rewards, terminal = 0, episodes = 100, cycles = 20, 
                              epsilon = 0.01)  
MC_policy

The improved policy makes sense. Transitions that move the robot closer to the goal are favored. 

Finally, execute the code in the cell below to compute the returns for the improved policy. 

In [None]:
nr.seed(369)
returns = MC_state_values(MC_policy, neighbors, rewards, terminal = 0, episodes = 1000)
np.array(returns).reshape((4,4))

These returns are significantly higher than for the random policy, indicating the policy is indeed an improvement. 

#### Copyright 2018, Stephen F Elston. All rights reserved.