<a href="https://colab.research.google.com/github/450586509/reinforcement-learning-practice/blob/master/experience_replay.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Training with experience replay
- Play game, sample <s,a,r,s'>.

- Update q-values based on <s,a,r,s'>.

- Store <s,a,r,s'> transition in a buffer.

- If buffer is full, delete earliest data.

Sample K such transitions from that buffer and update q-values based on them.

To enable such training, first we must implement a memory structure that would act like such a buffer.


In [1]:
# In google collab, uncomment this:
!wget https://bit.ly/2FMJP5K -q -O setup.py
!bash setup.py 2>&1 1>stdout.log | tee stderr.log

# This code creates a virtual display to draw game images on.
# If you are running locally, just ignore it
import os
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'

%load_ext autoreload
%autoreload 2
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import clear_output

--2019-09-16 15:27:06--  https://raw.githubusercontent.com/yandexdataschool/Practical_DL/fall18/xvfb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 640 [text/plain]
Saving to: ‘../xvfb’

     0K                                                       100%  111M=0s

2019-09-16 15:27:06 (111 MB/s) - ‘../xvfb’ saved [640/640]

Starting virtual X frame buffer: Xvfb.


In [0]:
import random
import queue


class ReplayBuffer(object):
    def __init__(self, size):
        """
        Create Replay buffer.
        Parameters
        ----------
        size: int
            Max number of transitions to store in the buffer. When the buffer
            overflows the old memories are dropped.

        Note: for this assignment you can pick any data structure you want.
              If you want to keep it simple, you can store a list of tuples of (s, a, r, s') in self._storage
              However you may find out there are faster and/or more memory-efficient ways to do so.
        """
        self._storage = queue.Queue()
        self._maxsize = size

        # OPTIONAL: YOUR CODE

    def __len__(self):
        return len(self._storage)

    def add(self, obs_t, action, reward, obs_tp1, done):
        '''
        Make sure, _storage will not exceed _maxsize. 
        Make sure, FIFO rule is being followed: the oldest examples has to be removed earlier
        '''
        data = (obs_t, action, reward, obs_tp1, done)

        # add data to storage
        if not self._storage.qsize == self._maxsize:
          self._storage.put(data)
        else:
          self._storage.get()
          self._storage.put(data)


    def sample(self, batch_size):
        """Sample a batch of experiences.
        Parameters
        ----------
        batch_size: int
            How many transitions to sample.
        Returns
        -------
        obs_batch: np.array
            batch of observations
        act_batch: np.array
            batch of actions executed given obs_batch
        rew_batch: np.array
            rewards received as results of executing act_batch
        next_obs_batch: np.array
            next set of observations seen after executing act_batch
        done_mask: np.array
            done_mask[i] = 1 if executing act_batch[i] resulted in
            the end of an episode and 0 otherwise.
        """
        #idxes = <randomly generate batch_size integers to be used as indexes of samples >

        # collect <s,a,r,s',done> for each index
        samples = random.sample(self._storage.queue, batch_size)
        states = [i[0] for i in samples]
        actions = [i[1] for i in samples]
        rewards = [i[2] for i in samples]
        next_states = [i[3] for i in samples]
        is_dones = [i[4] for i in samples]


        return np.array(states), np.array(actions), np.array(rewards), np.array(next_states), np.array(is_dones)