### Honor Track: experience replay
_This notebook builds upon `qlearning.ipynb`, or to be exact, generating qlearning.py._

There's a powerful technique that you can use to improve sample efficiency for off-policy algorithms: [spoiler] Experience replay :)

The catch is that you can train Q-learning and EV-SARSA on `<s,a,r,s'>` tuples even if they aren't sampled under current agent's policy. So here's what we're gonna do:

<img src=https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/exp_replay.png width=480>

#### Training with experience replay
1. Play game, sample `<s,a,r,s'>`.
2. Update q-values based on `<s,a,r,s'>`.
3. Store `<s,a,r,s'>` transition in a buffer. 
 1. If buffer is full, delete earliest data.
4. Sample K such transitions from that buffer and update q-values based on them.


To enable such training, first we must implement a memory structure that would act like such a buffer.

In [None]:
%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output

#XVFB will be launched if you run on a server
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY"))==0:
    !bash ../xvfb start
    %env DISPLAY=:1

In [None]:
from collections import deque

class ReplayBuffer(object):
    def __init__(self, size):
        """
        Create Replay buffer.

        Notes
        -----
        For this assignment you can pick any data structure you want.
        If you want to keep it simple, you can store a list of tuples of (s, a, r, s') in self._storage
        However you may find out there are faster and/or more memory-efficient ways to do so.        
        
        Parameters
        ----------
        size: int
            Max number of transitions to store in the buffer. When the buffer
            overflows the old memories are dropped.
        """
        self._storage = deque()
        self._maxsize = size
        
    def __len__(self):
        return len(self._storage)

    def add(self, obs_t, action, reward, obs_tp1, done):
        """
        Add replay to the buffer by adding <s, a, r, s', done>
        
        Notes
        -----
        self._storage will not exceed _maxsize. 
        FIFO rule is being followed: The oldest examples has to be removed first
        
        Parameters
        ----------
        obs_t : object
            The observation (state) at time t
        action : object
            The action at time t
        reward : object
            The reward gained from taking the action
        obs_tp1 : object
            The observation (state) at time t + 1 (after taking the action)
        done : object
            Whether or not the game is done
        """
        
        data = (obs_t, action, reward, obs_tp1, done)
        
        # add data to storage
        self._storage.append(data)
        
        if len(self._storage) > self._maxsize:
            # Pop the que
            self._storage.popleft()
        
    def sample(self, batch_size):
        """
        Sample a batch of experiences.
        
        Parameters
        ----------
        batch_size: int
            How many transitions to sample.
            
        Returns
        -------
        obs_batch: np.array
            Batch of observations
        act_batch: np.array
            Batch of actions executed given obs_batch
        rew_batch: np.array
            Rewards received as results of executing act_batch
        next_obs_batch: np.array
            Next set of observations seen after executing act_batch
        done_mask: np.array
            done_mask[i] = 1 if executing act_batch[i] resulted in
            the end of an episode and 0 otherwise.
        """
        
        indices = np.random.randint(low=0, high=len(self), size=batch_size)
        
        obs_batch = np.empty(batch_size, dtype=int)
        act_batch = np.empty(batch_size, dtype=int)
        rew_batch = np.empty(batch_size, dtype=int)
        next_obs_batch = np.empty(batch_size, dtype=int)
        done_mask = np.empty(batch_size, dtype=int)
        
        # Collect <s, a, r, s' ,done>
        for batch_nr, i in enumerate(indices):
            obs_batch[batch_nr] = self._storage[i][0]
            act_batch[batch_nr] = self._storage[i][1]
            rew_batch[batch_nr] = self._storage[i][2]
            next_obs_batch[batch_nr] = self._storage[i][3]
            done_mask[batch_nr] = self._storage[i][4]
                
        return obs_batch, act_batch, rew_batch, next_obs_batch, done_mask

Some tests to make sure your buffer works right

In [None]:
replay = ReplayBuffer(2)
obj1 = tuple(range(5))
obj2 = tuple(range(5, 10))
replay.add(*obj1)
assert replay.sample(1)==obj1, "If there's just one object in buffer, it must be retrieved by buf.sample(1)"
replay.add(*obj2)
assert len(replay._storage)==2, "Please make sure __len__ methods works as intended."
replay.add(*obj2)
assert len(replay._storage)==2, "When buffer is at max capacity, replace objects instead of adding new ones."
assert tuple(np.unique(a) for a in replay.sample(100))==obj2
replay.add(*obj1)
assert max(len(np.unique(a)) for a in replay.sample(100))==2
replay.add(*obj1)
assert tuple(np.unique(a) for a in replay.sample(100))==obj1
print ("Success!")

Now let's use this buffer to improve training:

In [None]:
import gym
from qlearning import QLearningAgent

env = gym.make("Taxi-v2")
n_actions = env.action_space.n

In [None]:
def play_and_train_with_replay(env,
                               agent,
                               replay=None, 
                               t_max=10**4,
                               replay_batch_size=32):
    """
    Run the full game, with actions given by the agent's policy
    and updates the policy whenever possible
    
    Parameters
    ----------
    env : gym-object
        The environment to play with
    agent : QLearningAgent
        The agent to play and train with
    replay : None or ReplayBuffer
        The replay buffer to use
        No experience replay will be used if replay is None
    t_max : int
        The maximum number of steps to take
    replay_batch_size : int
        The number to sample from the replay buffer (if replay buffer is not None)
    
    Returns
    -------
    total_reward : float
        The accumulated reward
    """
    total_reward = 0.0
    s = env.reset()
    
    for t in range(t_max):
        # Get agent to pick action given state s
        a = agent.get_action(s)
        
        next_s, r, done, _ = env.step(a)

        # Update agent on current transition. Use agent.update
        agent.update(s, a, r, next_s)

        if replay is not None:
            # Store current <s,a,r,s'> transition in buffer
            replay.add(s, a, r, next_s, done)
            
            # Sample replay_batch_size random transitions from replay, 
            # then update agent on each of them in a loop
            s_array, a_array, r_array, s_next_array ,_ = replay.sample(replay_batch_size)
            for s, a, r, s_next in zip(s_array, a_array, r_array, s_next_array):
                agent.update(s, a, r, s_next)
                    
        s = next_s
        total_reward +=r
        if done:
            break
    
    return total_reward

    total_reward = 0.0
    s = env.reset()
    
    for t in range(t_max):
        a = agent.get_action(s)
        
        next_s,r,done,_ = env.step(a)
        agent.update(s, a, r, next_s)
        
        s = next_s
        total_reward +=r
        if done:
            break
        
    return total_reward

In [None]:
# Create two agents: first will use experience replay, second will not.
agent_baseline = QLearningAgent(alpha=0.5, epsilon=0.25, discount=0.99,
                       get_legal_actions = lambda s: range(n_actions))

agent_replay = QLearningAgent(alpha=0.5, epsilon=0.25, discount=0.99,
                       get_legal_actions = lambda s: range(n_actions))

replay = ReplayBuffer(1000)

In [None]:
from IPython.display import clear_output
from pandas import DataFrame
moving_average = lambda x, span=100: DataFrame({'x':np.asarray(x)}).x.ewm(span=span).mean().values

rewards_replay, rewards_baseline = [], []

for i in range(1000):
    rewards_replay.append(play_and_train_with_replay(env, agent_replay, replay))
    rewards_baseline.append(play_and_train_with_replay(env, agent_baseline, replay=None))
    
    agent_replay.epsilon *= 0.99
    agent_baseline.epsilon *= 0.99
    
    if i %100 ==0:
        clear_output(True)
        print('Baseline : eps =', agent_replay.epsilon, 'mean reward =', np.mean(rewards_baseline[-10:]))
        print('ExpReplay: eps =', agent_baseline.epsilon, 'mean reward =', np.mean(rewards_replay[-10:]))
        plt.plot(moving_average(rewards_replay), label='exp. replay')
        plt.plot(moving_average(rewards_baseline), label='baseline')
        plt.grid()
        plt.legend()
        plt.show()

### Submit to Coursera

In [None]:
EMAIL = ''
TOKEN = ''

In [None]:
from submit import submit_experience_replay
submit_experience_replay(rewards_replay, rewards_baseline, EMAIL, TOKEN)

#### What to expect:

Experience replay, if implemented correctly, will improve algorithm's initial convergence a lot, but it shouldn't affect the final performance.

### Outro

We will use the code you just wrote extensively in the next week of our course. If you're feeling that you need more examples to understand how experience replay works, try using it for binarized state spaces (CartPole or other __[classic control envs](https://gym.openai.com/envs/#classic_control)__).

__Next week__ we're gonna explore how q-learning and similar algorithms can be applied for large state spaces, with deep learning models to approximate the Q function.

However, __the code you've written__ for this week is already capable of solving many RL problems, and as an added benifit - it is very easy to detach. You can use Q-learning, SARSA and Experience Replay for any RL problems you want to solve - just thow 'em into a file and import the stuff you need.