#Tutorial 5 - Options Intro

Please complete this tutorial to get an overview of options and an implementation of SMDP Q-Learning and Intra-Option Q-Learning.


### References:

 [Recent Advances in Hierarchical Reinforcement
Learning](https://people.cs.umass.edu/~mahadeva/papers/hrl.pdf) is a strong recommendation for topics in HRL that was covered in class. Watch Prof. Ravi's lectures on moodle or nptel for further understanding the core concepts. Contact the TAs for further resources if needed. 


In [None]:
'''
A bunch of imports, you don't have to worry about these
'''

import numpy as np
import random
import gym
from gym.wrappers import Monitor
import glob
import io
import matplotlib.pyplot as plt
from IPython.display import HTML
import pandas as pd

In [None]:
'''
The environment used here is extremely similar to the openai gym ones.
At first glance it might look slightly different. 
The usual commands we use for our experiments are added to this cell to aid you
work using this environment.
'''

#Setting up the environment
from gym.envs.toy_text.cliffwalking import CliffWalkingEnv
env = CliffWalkingEnv()

env.reset()

#Current State
print(env.s)

# 4x12 grid = 48 states
print ("Number of states:", env.nS)

# Primitive Actions
action = ["up", "right", "down", "left"]
#correspond to [0,1,2,3] that's actually passed to the environment

# either go left, up, down or right
print ("Number of actions that an agent can take:", env.nA)

# Example Transitions
rnd_action = random.randint(0, 3)
print ("Action taken:", action[rnd_action])
next_state, reward, is_terminal, t_prob = env.step(rnd_action)
print ("Transition probability:", t_prob)
print ("Next state:", next_state)
print ("Reward recieved:", reward)
print ("Terminal state:", is_terminal)
env.render()

36
Number of states: 48
Number of actions that an agent can take: 4
Action taken: right
Transition probability: {'prob': 1.0}
Next state: 36
Reward recieved: -100
Terminal state: False
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  C  C  C  C  C  C  C  C  C  C  T



#### Options
We custom define very simple options here. They might not be the logical options for this settings deliberately chosen to visualise the Q Table better.


In [None]:
# We are defining two more options here
# Option 1 ["Away"] - > Away from Cliff (ie keep going up)
# Option 2 ["Close"] - > Close to Cliff (ie keep going down) 

def Away(env,state):
    
    optdone = False
    optact = 0
    
    if (int(state/12) == 0):
        optdone = True
    
    return [optact,optdone]
    
def Close(env,state):
    
    optdone = False
    optact = 2
    
    if (int(state/12) == 2):
        optdone = True

    if (int(state/12) == 3):
        optdone = True
    
    return [optact,optdone]
    
    
'''
Now the new action space will contain
Primitive Actions: ["up", "right", "down", "left"]
Options: ["Away","Close"]
Total Actions :["up", "right", "down", "left", "Away", "Close"]
Corresponding to [0,1,2,3,4,5]
'''

'\nNow the new action space will contain\nPrimitive Actions: ["up", "right", "down", "left"]\nOptions: ["Away","Close"]\nTotal Actions :["up", "right", "down", "left", "Away", "Close"]\nCorresponding to [0,1,2,3,4,5]\n'

In [None]:

seed = 44
rg = np.random.RandomState(seed)
rg.rand()
rg.choice([0,1,2,3,4,5])

4

# Task 1
Complete the code cell below


In [None]:
#Q-Table: (States x Actions) === (env.ns(48) x total actions(6))

q_values_SMDP2 = np.zeros((48,6))
#Update_Frequency Data structure? Check TODO 4


ufd2 = np.zeros((48,6))#Update_Frequency Data structure

actions=[0,1,2,3,4,5]
# TODO: epsilon-greedy action selection function
seed = 36
rg = np.random.RandomState(seed)

def egreedy_policy(q_values,state,epsilon):
    if rg.rand() < epsilon:
        return rg.choice([0,1,2,3,4,5])
    else:
        #max = np.max(q_values[state]) 
        #return rg.choice(np.where(q_values[state] == max)[0])
        return np.argmax(q_values[state])

# Task 2
Below is an incomplete code cell with the flow of SMDP Q-Learning. Complete the cell and train the agent using SMDP Q-Learning algorithm.
Keep the **final Q-table** and **Update Frequency** table handy (You'll need it in TODO 4)

In [None]:
q_values_SMDP = np.zeros((48,6))
ufd1 = np.zeros((48,6))#Update_Frequency Data structure

In [None]:
#### SMDP Q-Learning 

# Add parameters you might need here
gamma = 0.9
alpha = 0.4

# Iterate over 1000 episodes
for _ in range(1000):
    state = env.reset()    
    done = False

    # While episode is not over
    while not done:
        
        # Choose action        
        action = egreedy_policy(q_values_SMDP, state, epsilon=0.1)
        
        # Checking if primitive action
        if action < 4:
            # Perform regular Q-Learning update for state-action pair

            next_state, reward, done,_ = env.step(action)
            q_values_SMDP[state, action] += alpha*(reward + gamma*np.max([q_values_SMDP[next_state, action] for action in actions]) - q_values_SMDP[state, action])
            ufd1[state,action] += 1
            state = next_state
        
        # Checking if action chosen is an option
        reward_bar = 0
        if action == 4: # action => Away option
            
            initial_state = np.copy(state)
            optdone = False
            count=0
            while (optdone == False):
                
                # Think about what this function might do?
                optact,_ = Away(env,state) 
                #
                next_state, reward, done,_ = env.step(optact)

                
                _,optdone = Away(env,next_state) 
                
                # Is this formulation right? What is this term?
                # Ans: the accumulates return for the entire option
                #if next_state != state:
                reward_bar = reward_bar +  (gamma**count)*reward
                count+=1
                # Complete SMDP Q-Learning Update
                # Remember SMDP Updates. When & What do you update? 
                state = next_state

            q_values_SMDP[initial_state, action] += alpha*(reward_bar + (gamma**count)*np.max([q_values_SMDP[state, action] for action in actions]) - q_values_SMDP[initial_state, action])
            ufd1[initial_state,action] += 1
              
           
        if action == 5: # action => Close option

            initial_state = np.copy(state)
            optdone = False
            count=0
            while (optdone == False):
                
                # Think about what this function might do?
                optact,_ = Close(env,state) 
                #
                next_state, reward, done,_ = env.step(optact)

                
                _,optdone = Close(env,next_state) 
                
                # Is this formulation right? What is this term?
                # Ans: the accumulates return for the entire option
                #if next_state != state:
                reward_bar = reward_bar +  (gamma**count)*reward
                count+=1
                # Complete SMDP Q-Learning Update
                # Remember SMDP Updates. When & What do you update? 
                state = next_state

            q_values_SMDP[initial_state, action] += alpha*(reward_bar + (gamma**count)*np.max([q_values_SMDP[state, action] for action in actions]) - q_values_SMDP[initial_state, action])
            ufd1[initial_state,action] += 1




# Task 3
Using the same options and the SMDP code, implement Intra Option Q-Learning (In the code cell below). You *might not* always have to search through options to find the options with similar policies, think about it. Keep the **final Q-table** and **Update Frequency** table handy (You'll need it in TODO 4)



In [None]:
#### Intra-Option Q-Learning 



# Add parameters you might need here
gamma = 0.9
alpha = 0.4

# Iterate over 1000 episodes
for _ in range(1000):
    state = env.reset()    
    done = False

    # While episode is not over
    while not done:
        
        # Choose action        
        action = egreedy_policy(q_values_SMDP2, state, epsilon=0.1)
        
        # Checking if primitive action
        if action < 4:
            # Perform regular Q-Learning update for state-action pair

            next_state, reward, done,_ = env.step(action)
            q_values_SMDP2[state, action] += alpha*(reward + gamma*np.max([q_values_SMDP2[next_state, action] for action in actions]) - q_values_SMDP2[state, action])
            ufd2[state,action] += 1

            state = next_state
        
        # Checking if action chosen is an option
        reward_bar = 0
        if action == 4: # action => Away option

            #initial_state = state
            optdone = False
            #count=0
            while (optdone == False) :
                
                # Think about what this function might do?
                optact,_ = Away(env,state) 
                #
                next_state, reward, done,_ = env.step(optact)
                _,optdone = Away(env,next_state) 

                q_values_SMDP2[state, optact] += alpha*(reward + gamma*np.max([q_values_SMDP2[next_state, action] for action in actions]) - q_values_SMDP2[state, optact])
                ufd2[state,optact] += 1
                
                if not optdone:
                  q_values_SMDP2[state, action] += alpha*(reward + gamma*q_values_SMDP2[next_state, action] - q_values_SMDP2[state, action])
                  ufd2[state,action] += 1
                else:
                  q_values_SMDP2[state, action] += alpha*(reward + gamma*np.max([q_values_SMDP2[next_state, action] for action in actions]) - q_values_SMDP2[state, action])
                  ufd2[state,action] += 1

            
                # Complete SMDP Q-Learning Update
                # Remember SMDP Updates. When & What do you update? 
                state = next_state

            

              
           
        if action == 5: # action => Close option

            #initial_state = state
            optdone = False
            #count=0
            while (optdone == False) :
                
                # Think about what this function might do?
                optact,_ = Close(env,state) 
                #
                next_state, reward, done,_ = env.step(optact)
                _,optdone = Close(env,next_state) 

                q_values_SMDP2[state, optact] += alpha*(reward + gamma*np.max([q_values_SMDP2[next_state, action] for action in actions]) - q_values_SMDP2[state, optact])
                ufd2[state,optact] += 1
                
                if not optdone:
                  q_values_SMDP2[state, action] += alpha*(reward + gamma*q_values_SMDP2[next_state, action] - q_values_SMDP2[state, action])
                  ufd2[state,action] += 1
                else:
                  q_values_SMDP2[state, action] += alpha*(reward + gamma*np.max([q_values_SMDP2[next_state, action] for action in actions]) - q_values_SMDP2[state, action])
                  ufd2[state,action] += 1

          
                # Complete SMDP Q-Learning Update
                # Remember SMDP Updates. When & What do you update? 
                state = next_state




# Task 4
Compare the two Q-Tables and Update Frequencies and provide comments.

In [None]:
from pandas.core.frame import DataFrame
def table_render(arr):
  print(DataFrame(arr,columns=["up", "right", "down", "left", "Away", "Close"]))

table_render(q_values_SMDP)

          up       right        down      left      Away       Close
0  -7.723350   -7.673279   -7.673700 -7.720924 -7.719956   -7.684267
1  -7.492889   -7.429193   -7.429922 -7.656083 -7.547030   -7.436956
2  -7.185370   -7.160253   -7.160402 -7.227078 -7.251876   -7.165101
3  -7.056034   -6.849924   -6.850342 -7.049109 -6.852688   -6.853364
4  -6.518621   -6.504301   -6.507254 -6.829365 -6.658785   -6.504529
5  -6.280359   -6.118122   -6.120294 -6.409919 -6.235497   -6.118377
6  -5.869688   -5.690003   -5.690244 -6.276430 -5.861083   -5.689720
7  -5.349023   -5.213921   -5.215011 -5.516740 -5.343222   -5.214205
8  -4.829440   -4.684439   -4.684463 -5.075277 -4.950524   -4.684143
9  -4.196216   -4.094398   -4.094810 -4.466637 -4.260241   -4.094529
10 -3.778425   -3.438831   -3.438774 -4.451016 -3.731222   -3.438709
11 -3.149875   -3.155434   -2.709993 -3.102907 -2.786104   -2.709992
12 -7.666658   -7.437925   -7.438822 -7.503608 -7.445794   -7.443662
13 -7.469991   -7.169796   -7.1698

In [None]:
table_render(q_values_SMDP2)

          up       right        down      left      Away       Close
0  -7.791814   -7.709004   -7.710273 -7.812283 -7.791814   -7.708910
1  -7.568938   -7.457301   -7.457524 -7.485172 -7.568745   -7.457161
2  -7.399192   -7.175212   -7.175475 -7.339622 -7.399138   -7.175404
3  -6.977592   -6.861692   -6.861799 -7.023587 -6.974668   -6.861738
4  -6.545810   -6.513088   -6.513191 -6.874520 -6.515168   -6.513128
5  -6.393669   -6.125717   -6.125740 -6.485679 -6.386120   -6.125727
6  -6.034377   -5.695277   -5.695322 -6.291167 -6.033929   -5.695305
7  -5.385022   -5.217016   -5.217021 -5.503846 -5.383960   -5.217017
8  -5.064566   -4.685585   -4.685587 -4.750355 -4.963330   -4.685585
9  -4.481449   -4.095099   -4.095100 -4.946974 -4.344957   -4.095098
10 -3.991223   -3.439000   -3.439000 -4.122576 -3.617607   -3.439000
11 -2.828941   -2.786104   -2.710000 -3.301609 -2.786104   -2.710000
12 -7.934448   -7.457918   -7.458099 -7.532526 -7.934448   -7.458062
13 -7.709915   -7.175696   -7.1757

Note that both the methods have converged to similar Q-values.
 The q-values are very low; close to -106 for action 'down' and option 'close' in states 25-35, since it represents the row above the cliff, and the agent has learnt to avoid those actions.

In [None]:
table_render(ufd1)

        up   right   down  left  Away  Close
0     37.0    82.0   58.0  37.0  37.0   22.0
1     34.0    86.0   50.0  26.0  35.0   21.0
2     32.0    91.0   53.0  23.0  32.0   22.0
3     32.0    90.0   50.0  21.0  29.0   21.0
4     27.0    90.0   48.0  20.0  28.0   20.0
5     25.0    89.0   44.0  17.0  25.0   20.0
6     23.0    81.0   42.0  17.0  22.0   19.0
7     19.0    72.0   39.0  14.0  19.0   19.0
8     17.0    58.0   37.0  12.0  17.0   19.0
9     14.0    49.0   37.0  10.0  14.0   20.0
10    12.0    34.0   32.0  11.0  12.0   20.0
11    10.0    10.0   34.0   7.0   8.0   26.0
12    28.0    70.0   26.0  34.0  26.0   27.0
13    26.0    85.0   29.0  23.0  25.0   28.0
14    23.0    92.0   30.0  20.0  23.0   30.0
15    21.0    87.0   30.0  19.0  21.0   29.0
16    19.0    87.0   29.0  18.0  20.0   29.0
17    18.0    85.0   28.0  16.0  16.0   28.0
18    16.0    79.0   29.0  14.0  14.0   27.0
19    14.0    74.0   28.0  12.0  13.0   27.0
20    13.0    67.0   26.0  10.0  11.0   27.0
21    11.0

In [None]:
table_render(ufd2)

        up   right    down  left  Away  Close
0     39.0    68.0    52.0  38.0  37.0   34.0
1     40.0    84.0    48.0  27.0  35.0   36.0
2     38.0    88.0    49.0  23.0  35.0   38.0
3     34.0    96.0    49.0  21.0  30.0   39.0
4     28.0    93.0    51.0  20.0  26.0   38.0
5     29.0    93.0    45.0  18.0  26.0   37.0
6     28.0    81.0    46.0  18.0  24.0   37.0
7     21.0    79.0    42.0  14.0  19.0   35.0
8     24.0    74.0    43.0  12.0  17.0   36.0
9     18.0    60.0    43.0  13.0  14.0   36.0
10    15.0    41.0    42.0  10.0  11.0   37.0
11    10.0     8.0    61.0   8.0   8.0   56.0
12    93.0    70.0    51.0  35.0  92.0   45.0
13    68.0    82.0    55.0  23.0  65.0   51.0
14    55.0    97.0    59.0  22.0  53.0   56.0
15    58.0   102.0    60.0  18.0  56.0   57.0
16    50.0   102.0    61.0  18.0  46.0   59.0
17    45.0    97.0    61.0  17.0  41.0   58.0
18    38.0    97.0    62.0  13.0  37.0   59.0
19    42.0    88.0    59.0  13.0  38.0   57.0
20    37.0    78.0    62.0  11.0  

In [None]:
np.sum(ufd1),np.sum(ufd2)

(21171.0, 23664.0)

In [None]:
print(["up", "right", "down", "left", "Away", "Close"])
print(np.sum(ufd1,axis=0))
print(np.sum(ufd2,axis=0))

['up', 'right', 'down', 'left', 'Away', 'Close']
[ 2515. 13840.  2097.   874.   848.   997.]
[ 3149. 13826.  2905.   863.  1370.  1551.]


Note that the no of updates for the intra option Q-learning is greater than SMDP q-learning. 
Particularly observe the frequencies of actions 'up' and 'down' and options 'Away' and 'Close'; it's much higher for intra option Q-learning as expected, this is because we update actions 'up' and 'down' even while performing the options. And further the options are updated for each of the intermediate steps.