## On-policy learning and SARSA

This notebook builds on `qlearning.ipynb` to implement Expected Value SARSA.

The policy we're gonna use is epsilon-greedy policy, where agent takes optimal action with probability $(1-\epsilon)$, otherwise samples action at random. Note that agent __can__ occasionally sample optimal action during random sampling by pure chance.

In [None]:
#XVFB will be launched if you run on a server
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1
        
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
from qlearning import QLearningAgent

class EVSarsaAgent(QLearningAgent):
    """ 
    An agent that changes some of q-learning functions to implement Expected Value SARSA.
    
    Notes
    ------
    This implementation assumes that QLearningAgent.update uses get_value(next_state).
    """
    
    def get_value(self, state):
        """ 
        Returns Vpi for current state under epsilon-greedy policy:
          V_{pi}(s) = sum_{over a_i} {pi(a_i | s) * Q(s, a_i)}
        
        Parameters
        ----------
        state : object
            The state to get the value function value from
            Must be a valid key in Q-value
        """
        epsilon = self.epsilon
        possible_actions = self.get_legal_actions(state)
        len_possible_actions = len(possible_actions)
        
        #If there are no legal actions, return 0.0
        if len_possible_actions == 0:
            return 0.0

        q_values_state = np.empty(len(possible_actions))
        for i, action in enumerate(possible_actions):
            q_values_state[i] =  self.get_qvalue(state, action)
        
        # Create the policy
        # We start by making an array where the probability of any action is epsilon 
        pi_state = epsilon*np.ones(len_possible_actions)/len_possible_actions
        # We pick the index with highest Q-value...
        best_action_index = q_values_state.argmax()
        # ...and add 1 - epsilon to that action
        pi_state[best_action_index] += (1-epsilon)
        # NOTE: In order to convince ourselves that the total probability is 1, note that
        #       n*(epsilon/n) + (1-epsilon) = 1 
        
        # Calculate the state value
        state_value = (pi_state*q_values_state).sum()
        
        return state_value

### Cliff World

Let's now see how our algorithm compares against q-learning in case where we force agent to explore all the time.

<img src=https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/cliffworld.png width=600>
<center><i>image by cs188</i></center>

In [None]:
import gym, gym.envs.toy_text
env = gym.envs.toy_text.CliffWalkingEnv()
n_actions = env.action_space.n

print(env.__doc__)

In [None]:
# Our cliffworld has one difference from what's on the image: there is no wall. 
# Agent can choose to go as close to the cliff as it wishes. x:start, T:exit, C:cliff, o: flat ground
env.render()

In [None]:
def play_and_train(env, 
                   agent,
                   t_max=10**4):
    """
    Run the full game, with actions given by the agent's policy
    and updates the policy whenever possible
    
    Parameters
    ----------
    env : gym-object
        The environment to play with
    agent : QLearningAgent
        The agent to play and train with
    t_max : int
        The maximum number of steps to take
    
    Returns
    -------
    total_reward : float
        The accumulated reward
    """
    
    total_reward = 0.0
    s = env.reset()
    
    for t in range(t_max):
        a = agent.get_action(s)
        
        next_s,r,done,_ = env.step(a)
        agent.update(s, a, r, next_s)
        
        s = next_s
        total_reward +=r
        if done:
            break
        
    return total_reward

In [None]:
from qlearning import QLearningAgent

agent_sarsa = EVSarsaAgent(alpha=0.25, epsilon=0.2, discount=0.99,
                       get_legal_actions = lambda s: range(n_actions))

agent_ql = QLearningAgent(alpha=0.25, epsilon=0.2, discount=0.99,
                       get_legal_actions = lambda s: range(n_actions))

In [None]:
from IPython.display import clear_output
from pandas import DataFrame
moving_average = lambda x, span=100: DataFrame({'x':np.asarray(x)}).x.ewm(span=span).mean().values

rewards_sarsa, rewards_ql = [], []

for i in range(5000):
    rewards_sarsa.append(play_and_train(env, agent_sarsa))
    rewards_ql.append(play_and_train(env, agent_ql))
    #Note: agent.epsilon stays constant
    
    if i %100 ==0:
        clear_output(True)
        print('EVSARSA mean reward =', np.mean(rewards_sarsa[-100:]))
        print('QLEARNING mean reward =', np.mean(rewards_ql[-100:]))
        plt.title("epsilon = %s" % agent_ql.epsilon)
        plt.plot(moving_average(rewards_sarsa), label='ev_sarsa')
        plt.plot(moving_average(rewards_ql), label='qlearning')
        plt.grid()
        plt.legend()
        plt.ylim(-500, 0)
        plt.show()

Let's now see what did the algorithms learn by visualizing their actions at every state.

In [None]:
def draw_policy(env, agent):
    """
    Prints CliffWalkingEnv policy with arrows. 
    Note that this is done in a hard-coded fashion.
    
    Parameters
    ----------
    env : gym-object
        The environment to play with
    agent : QLearningAgent
        The agent to play and train with
    """
    
    n_rows, n_cols = env._cliff.shape
    actions = '^>v<'
    
    for yi in range(n_rows):
        for xi in range(n_cols):
            if env._cliff[yi, xi]:
                print(" C ", end='')
            elif (yi * n_cols + xi) == env.start_state_index:
                print(" X ", end='')
            elif (yi * n_cols + xi) == n_rows * n_cols - 1:
                print(" T ", end='')
            else:
                print(" %s " % actions[agent.get_best_action(yi * n_cols + xi)], end='')
        print()

In [None]:
print("Q-Learning")
draw_policy(env, agent_ql)

print("SARSA")
draw_policy(env, agent_sarsa)

### Submit to Coursera

In [None]:
EMAIL = ''
TOKEN = ''

In [None]:
from submit import submit_sarsa
submit_sarsa(rewards_ql, rewards_sarsa, EMAIL, TOKEN)

### More

Here are some of the things you can do if you feel like it:

* Play with epsilon. See learned how policies change if you set epsilon to higher/lower values (e.g. 0.75).
* Expected Value SASRSA for softmax policy:
$$ \pi(a_i|s) = softmax({Q(s,a_i) \over \tau}) = {e ^ {Q(s,a_i)/ \tau}  \over {\sum_{a_j}  e ^{Q(s,a_j) / \tau }}} $$
* Implement N-step algorithms and TD($\lambda$): see [Sutton's book](http://incompleteideas.net/book/bookdraft2018jan1.pdf) chapter 7 and chapter 12.
* Use those algorithms to train on CartPole in previous / next assignment for this week.