# Experience replay $Q$-learning
If you have run the $Q$-learning algorithm on the cartpole example, you have seen that it needs many iterations (around 2885 iterations) to solve the problem. That is bad. We can improve the method significantly by a simple adjustment. The idea is to build a history of experiences and sample the history randomly to take samples for the learning. The experience replay $Q$-algorithm has two more components in comparison with the $Q$-learning

In the sequel, we define some parts that can be added to the $Q$-learning method to make it work better.

## 1. Memory
We build a memory to save data points $s,\:a,\:r,\:s^{\prime},\:done$ through time. Each data point contains $s$: `state`, $a$: `action`, $r$: `reward`, $s^{\prime}$: `next_state`, and $done$: the boolean which shows if the episode ended. If the memory is full, the oldest data is disregarded and the new data is added.


```
def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))

```

## 2. Replay
When we want to learn the network, instead of using the data from the latest episode, we sample the memory batch. This way we have more diverge data to learn and it helps us to learn better.

```
def replay(self, batch_size):

    batch = random.sample(self.memory, min(len(self.memory), batch_size))
    states, actions, rewards, new_states, dones = list(map(lambda i: [j[i] for j in batch], range(5)))
    loss = self.update_network(states, actions, rewards, new_states, dones)
    return loss

```

## 3. Putting all together
Now, we put all steps together to run experience replay $Q$-learning algorithm. 

First, we build a (deep) network to represent $Q(s,a)$= `network(s)` and initiate an empty `memory=[]`. Then, we iteratively improve the network. In each iteration of the algorithm, we do the following
* i. We rollout the environment to collect data for expience replay $Q$ learning by following these steps:
    * i.a. We observe the `state` $s$ and select the `action` $a$.
    * i.b. We derive the environment using $a$ and observe the `reward` $r$, the next state $s^{\prime}$, and the boolean $done$ (which is `True` if the episode has ended and `False` otherwise).
    * i.c. We add $s,\:a,\:r,\:s^{\prime},\:done$ to `memory`. See section 1.
    * i.d. We continue from i.a. until the episode ends.
* ii. We improve the $Q$ network
    * ii.a. We sample a batch from `memory`. Let `states`, `actions`, `rewards`, `next_states`, `dones` denote the sampled batch. See section 2.
    * ii.b. We supply `states`, `actions`, `rewards`, `next_states`, `dones` to the network and optimize the parameters of the network. See section 2.

## 4. Related files
* [Experience replay $Q$ learning: The cartpole example (study and code)](replay_q_on_cartpole_notebook.ipynb)
* [Experience replay $Q$ learning: The cartpole example (code only)](./cartpole/replay_q_on_cartpole.py)