In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np

from simple_grid import simple_grid as gridworld
from simple_grid_agent import GridworldAgent as Agent

Read through all the classes and functions defined inside `simple_grid` environment and `GridworldAgent` to familiarize yourself with the details of this assignment.

Consider a simple gridworld where actions do not result in deterministic state changes. We specify that there is a $20\%$ probability that the selected action would result in a stochastic state transition

In [2]:
#stochastic environment
env = gridworld(wind_p=0.2)

The following set of commands will help you familiarize with different components of the gridworld

In [3]:
print('\n Reward For each Tile \n')
env.print_reward()


 Reward For each Tile 


----------
0 |0 |0 |
----------
0 |-5 |5 |
----------
0 |0 |0 |

Check out the set of possible actions for the grid

In [4]:
print('\n Set of possible actions in numerical form. These are actual inputs to the gridworld agent \n')
print(env.action_space)

print('\n Set of possible actions in the grid in text form. They map 1 to 1 from numbers above to direction \n')
print(env.action_text)


 Set of possible actions in numerical form. These are actual inputs to the gridworld agent 

[0 1 2 3]

 Set of possible actions in the grid in text form. They map 1 to 1 from numbers above to direction 

['U' 'L' 'D' 'R']


Consider a policy which tries to reach the goal state(+5) as fast as possible. Below we define the policy to evaluate the state values for this policy

In [5]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

print('\n Policy: Fastest Path to Goal State(Does not take reward into consideration) \n')
a.print_policy()


 Policy: Fastest Path to Goal State(Does not take reward into consideration) 


----------
R |R |D |
----------
R |R |U |
----------
R |U |U |

**Q1**

Implement the `get_v` and `get_q` methods to estimate the state value and state-action value in `simple_grid_agent.py`. These may be used later on for debugging your code

**Q2** 

The Monte Carlo rollout itself has been implemented in `simple_grid_agent.py` inside the `run_episode` method.

**Implement** 

First-visit as well as any-visit Monte Carlo state-value estimation equations inside `mc_predict_v` in `simple_grid_agent.py`.
These have been discussed in class. Refer to Sutton and Barto Chapter 5 for further details to implement them.

Test and report inside this notebook the results using the following commands. Are there sufficient differences in the state values under anyvisit and firstvisit MC Prediction? Why?

NB: assume anyvist and everyvisit to be interchangeable terms

#### From the printed state values from two different approaches, we can tell that: 
##### 1. There are differences between the values from these two estimation approaches, First-visit approach has a slightly higher estimation than Every-visit approach overall. It might be caused by the change of state value estimation over time. The agent will probabaly change its expecitation/estimation of same states while visiting those states over and over again.
##### 2. The trend of the changes between states are similar, which mean both approaches can give similar guidance/advice about what states to reach for the agent.

In [6]:
 # evaluate state values for policy_fast for both first-vist and any-vist
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)
print('\n State Values for first_visit MC state estiamtion \n')
a.mc_predict_v()
a.print_v()

a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)
print('\n State Values for any_visit MC state estiamtion \n')
a.mc_predict_v(first_visit=False)
a.print_v()


 State Values for first_visit MC state estiamtion 


---------------
0.2 |1.8 |3.5 |
---------------
-2.4 |3.6 |0 |
---------------
-2.7 |-2.1 |3.0 |
 State Values for any_visit MC state estiamtion 


---------------
-1.0 |0.7 |2.7 |
---------------
-3.6 |2.0 |0 |
---------------
-4.0 |-3.4 |2.4 |

**Q3** 

The Monte Carlo rollout itself has been implemented in `simple_grid_agent.py` inside the `run_episode` method.

**Implement** 

First-visit as well as any-visit Monte Carlo state-action value estimation equations inside `mc_predict_q` in `simple_grid_agent.py`
These have been discussed in class. Refer to Sutton and Barto Chapter 5 for further details to implement them.

Test and report inside this notebook the results using the following commands. Are there sufficient differences in the state values under anyvisit and firstvisit MC Q value Prediction? Why?

#### From the printed information from two different approaches, we can tell
##### 1. There are differences between the (state,action) values from two approaches, because we are trying to do a sequential control, the (state,action) tuple might be visited for multiple times from different(or even the same) trajectory generated. So there are changes between the (state,action) value estimation, since the position of the tuple, the times of visits of the tuple might matter.
##### 2. However, we can tell that both approaches give the similar answers. And this means, no matter First-visit or Every-visit can provide guidance to the agent to make optimal moves.

In [7]:
# evaluate state action values for policy_fast
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

print('\n State action Values for first_visit MC state action estiamtion \n')
a.mc_predict_q()
# a.print_policy()
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])
    

# evaluate state action values for policy_fast
a = Agent(env, policy = policy_fast, gamma = 0.9, 
            start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

print('\n State action Values for any_visit MC state action estiamtion \n')
a.mc_predict_q(first_visit=False)
# a.print_policy()
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])


 State action Values for first_visit MC state action estiamtion 


 Actions ['U' 'L' 'D' 'R'] 

(2, 1) [-2.73734084 -3.9595396  -3.42849716  1.38691585]
(2, 0) [-3.66056765 -3.91003537 -3.88193042 -3.02787543]
(2, 2) [ 3.45119264 -3.29143713  1.43453296  1.23892577]
(1, 1) [-0.27670621 -3.43372339 -2.97218811  3.74845041]
(0, 1) [-0.06783709 -1.58338738 -3.51127176  1.92980106]
(0, 0) [-1.22502463 -1.46422928 -3.97160695  0.017357  ]
(1, 0) [-1.50851959 -3.70565529 -4.11510011 -2.94032953]
(0, 2) [1.69589533 0.10174607 3.44474671 1.85637429]
(1, 2) [0. 0. 0. 0.]

 State action Values for any_visit MC state action estiamtion 


 Actions ['U' 'L' 'D' 'R'] 

(2, 2) [ 3.28081506 -3.53646088  1.3112994   1.24599079]
(2, 1) [-3.65257051 -4.55715791 -4.1026761   1.00795125]
(2, 0) [-4.21606652 -4.64931366 -4.61675245 -3.83922418]
(1, 1) [-0.44290205 -4.35385651 -4.0343107   3.30138878]
(1, 0) [-2.28536659 -4.61978074 -4.6861435  -3.66622055]
(0, 0) [-1.53592134 -2.27869133 -4.27083841 -0.646

**Q4**

Now we implement Monte Carlo control using state-action values. 

**Implement**

Complete the snippet in `mc_control_q` inside `simple_grid_agent.py`

Test and report inside this notebook the results using the following commands

#### From the previous policy, the action for (1, 0) state changed, which makes sense because moving to the right state will give the highest penalty(-5) and only by moving up(or down) can avoid reaching that state. By taking the those suggested actions, the agent would easily reach the "best" state, however it is a stochastic envrionment, our agent would still take some undesired actions. Though undesired actions might be taken, the agent will probably still be able to find the optimal action using current policy.

In [8]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
        start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

# Run MC Control
a.mc_control_q(n_episode = 1000,first_visit=False)
a.print_policy()

print('\n Actions: {env.action_text} \n')
for i in a.q: print(i,a.q[i])


----------
R |R |D |
----------
U |R |U |
----------
R |R |U |
 Actions: {env.action_text} 

(1, 1) [-1.53514061 -4.14393046 -3.845456    3.36764789]
(0, 1) [-0.5897438  -2.685826   -4.91362876  1.37192667]
(0, 2) [ 0.98782491 -1.6768405   3.31530592  0.95977406]
(1, 0) [-0.88901848 -4.67981841 -4.53371289 -3.95512805]
(2, 0) [-4.1761788  -4.51790078 -4.47265472 -3.77435347]
(2, 1) [-3.68603312 -3.75119676 -4.07400469  1.12058132]
(0, 0) [-3.97345906 -1.27387    -5.7291139  -0.44024942]
(2, 2) [ 3.56668064 -4.21835055  0.649178    1.10784838]
(1, 2) [0. 0. 0. 0.]


**Q5**

Bonus!

**Implement**

Greedy within The Limit of  Iinfinite Exploration MC Control in `mc_control_glie` function inside `simple_grid_agent.py`

Test and report inside this notebook the results using the following commands

In [12]:
#stochastic environment
env = gridworld(wind_p=0.2)

#initial policy
policy_fast = {(0, 0): 3,
          (0, 1): 3,
          (0, 2): 2,
          (1, 0): 3,
          (1, 1): 3,
          (1, 2): 0,
          (2, 0): 3,
          (2, 1): 0,
          (2, 2): 0}

#stochastic agent - epsilon greedy with decays
a = Agent(env, policy = policy_fast, gamma = 0.9, 
        start_epsilon=0.9,end_epsilon=0.3,epsilon_decay=0.9)

a.mc_control_glie(n_episode = 1000)
a.print_policy()
print('\n Actions', env.action_text, '\n')
for i in a.q: print(i,a.q[i])


----------
R |U |L |
----------
R |R |U |
----------
U |R |D |
 Actions ['U' 'L' 'D' 'R'] 

(1, 1) [-0.01703777 -0.05953831 -0.0296877   0.00555554]
(2, 0) [ 0.01084514 -0.03243344 -0.01962902 -0.00332621]
(2, 1) [-0.00310368 -0.11587576 -0.0558164   0.01347614]
(2, 2) [ 0.0240962  -0.12748024  0.17665254  0.13684205]
(1, 0) [-0.18295813 -0.0545739  -0.01946614  0.00767557]
(0, 1) [ 0.12241883 -0.23462894 -0.1849119   0.02125264]
(0, 0) [-0.12097907 -0.521952   -0.92226336  0.00143244]
(0, 2) [-0.15077283  0.02931431  0.02530968 -0.0571634 ]
(1, 2) [0. 0. 0. 0.]


#### Reference:
##### https://ai.stackexchange.com/questions/6486/why-is-glie-monte-carlo-control-an-on-policy-control
##### https://www.jeremyjordan.me/rl-learning-implementations/

#### Notes about my solution:
##### 1. new update method for Q values: Q(S, A) <- Q(S, A) + learning_factor * (G - Q(S, A))
##### 2. Learning factor is defined as lr if lr is not zero, otherwise it will be assigned by 1/N, N is the number of visits of current state S.