# Prioritised experience replay

I've already built a self driving taxi using a concept in machine reinforcement learning called uniform experience replay to aid in the process deep Q-learning. 

After coming across the paper below however, I've seen the amazing performance gains that this method of a more stratified sampling of replay memory can offer.

Check out the origional  [Google Deepmind 2016 PRIORITIZED EXPERIENCE REPLAY](https://arxiv.org/pdf/1511.05952v3.pdf) if you want a deeper understanding.

## So your taxi drives itself?

You might've gathered by now that this taxi is self driving but what does this mean?

If you're not familiar with what machine learning (ML) is, I recommend you go and check out [my blog post](https://medium.com/@ross.nkama/an-intuitive-understanding-of-machine-learning-6814add2b2a9) to get a high level understanding.

So I've been getting my machine intelligence's Neural Network to use a technique called reinforcement learning to incrementally update it's structure based off of it's ability to find correlations between the state that it's currently experiencing and (from experience) that state's ability to bring about particular rewards and consuquences at particular times.

These experiences I'm talking about are transitions in time where:

$\hspace{1cm} transition = [s,a,s\prime,r]$

Which is to say that a transition is where an action $a$, taken at a state $s$, leads to the agent recieving a reward $r$ (positive or negative) at the next state $s\prime$.

And previously, in my first implementation of the taxi, I stored a bunch of these tranition batches into a sliding window of memory and sampled from this memory, a uniform distribution of experiences as is being shown in our first algorithm gotten from [The Effects of Memory Replay in Reinforcement Learning - Stanford university](https://arxiv.org/pdf/1710.06574.pdf):

***
**_Algorithm 1_**: Uniform sampling form replay memory in a DQN
***


1: **INPUT**: minibatch size $M$, capacity $C$, learning rate $\alpha$, discount factor $\gamma$, initial weights $\theta_0$, policy$_{t}$ $\pi$, total transitions $T$.  
2: Initialise replay memory BUFFER with capacity C.  
3: Observe initial state $S_{0}$  
4: for $t = 1$ to $T$ do  
5: $\hspace{0.9cm}$ select hypothetical best action from policy $a_{t}=\pi(S_{t})$  
6: $\hspace{0.9cm}$ observe $r$ and $s\prime$  
7: $\hspace{0.9cm}$ In memory BUFFER, store transition $[s, a, s\prime,r]$  
8: $\hspace{0.9cm}$ From a uniform distribition of BUFFER, sample $[s, a, s\prime,r]$  
9: $\hspace{0.9cm}$  for i = 1 to M do   
10: $\hspace{1.6cm}$ Compute temporal difference error:  
$\hspace{2.8cm}$ $TD_{error} = R(s,a) + \gamma \Sigma_{t=1}^T p(s\prime|_{s,a}) \cdot \overset{max}{_a\prime}(Q(s\prime, a\prime)) - Q(s, a)_{t-1}$  
11: $\hspace{1.65cm}$ Update weights $\theta = \theta + \alpha TD_{error_t}(a,s)$ where $\theta$ directly changes $Q(s, a).$  
12: $\hspace{0.85cm}$ end for  
13: end for


***

However, after reading Google's paper, it'd come to light that with uniform experience replay, a lot of the useful memories stored were being wasted as soon as I considered that facts that:
> "_An RL agent can learn more effectively from some transitions than from others._" - PRIORITIZED EXPERIENCE REPLAY, Google deepmind pg 1

and

> "_Prioritized replay further liberates agents from considering transitions with the same frequency that they are experienced._" - PRIORITIZED EXPERIENCE REPLAY, Google deepmind pg 1