# Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

# Experience Replay

After defining the environment in the previous section you will now learn the mechanism of experience replaying and how the agent learns from these experiences. 

This notebook will cover:
1. Definition of the memory or replay buffer
2. Processing of experiences to create arrays target Q-values for agent learning

In this notebook, you will perform the following steps:

1. [Import Modules](#modules)
2. [Code for replaying experiences](#exp)

## Import Modules

In the code below you import the modules.

In [1]:
import pandas as pd
import numpy as np

<a id='exp'></a> 
## Code for replaying experiences

The code for experience replay has three essential functions:

1. ```init()``` - which initialises the buffer and sets the maximum size of this buffer
2. ```remember()``` - which adds new experiences to the buffer and truncates older ones
3. ```process()``` - which returns the input state and target Q-values. This is so that we can update Q-values for a given state action pair while training the agent.

The ```process()``` function is the most important of the three. The flow of the ```process()``` function as follows:

1. You randomly select experiences from the memory buffer. This returns many entries with the structure of S.A.R.S. Which is state, action, reward and next state.
2. For each such experience, you first take the state at time t. This state is used to get the Q-values from model R. These Q-values from model R is stored in a target vector. They tell you the current importance of each action in this state.
3. Thereafter, you get the Q-value of the most optimal action using the model Q for the state at time t+1. The state at t+1 is the next state.
4. This Q-value for the next state is discounted first. It is then added to the reward earned for the transition from the state at t and t+1.
5. This summed value is then set as the Q-value for action taken at state at t in the target vector. This is the only action for which the Q-value is replaced. This is because we want to train the agent to learn the importance of this action based on the value of the next state.
6. All such state (state at t) and target vectors are returned for all sampled experiences.

In [2]:
# The rate at which Q-values of subsequent states are discounted
DISCOUNT_RATE = 0.99

In [3]:
class ExperienceReplay(object):
    '''This class calculates the Q-Table.
    It gathers memory from previous experience and 
    creates a Q-Table with states and rewards for each
    action using the NN. At the end of the game the reward
    is calculated from the reward function. 
    The weights in the NN are constantly updated with each new
    batch of experience. 
    This is the heart of the RL algorithm.
    Args:
        state_tp1: state at time t+1
        state_t: state at time t
        action_t: int {0..2} hold, sell, buy taken at state_t 
        Q_sa: float, reward for state_tp1
        reward_t: reward for state_t
        self.memory: list of state_t, action_t and reward_t at time t as well as state_tp1
        targets: array(float) Nx2, weight of each action 
        inputs: an array with scrambled states at different times
        targets: Nx3 array of weights for each action for scrambled input states
    '''
    def __init__(self, max_memory=1000, discount=DISCOUNT_RATE):
        # Set the length of the memoty buffer
        self.max_memory = max_memory
        # Initialise the memory as a list
        self.memory = list()
        # Set the reward and q-value tradeoff 
        self.discount = discount

    def remember(self, states, game_over):
        # Add states to time t and t+1 as well as  to memory
        self.memory.append([states, game_over])
        # If entries added are more than max_memory
        # truncate the first entry
        if len(self.memory) > self.max_memory:
            del self.memory[0]

    def process(self, modelQ, modelR, batch_size=10):
        # Get the length of the memory filled in the buffer
        len_memory = len(self.memory)
        # Get the number of actions the agent 
        num_actions = modelQ.output_shape[-1]
        # Get the shape of state
        env_dim = self.memory[0][0][0].shape[1]
        
        # Initialise input and target arrays
        inputs = np.zeros((min(len_memory, batch_size), env_dim))
        targets = np.zeros((inputs.shape[0], num_actions))
        
        # Step randomly through different places in the memory
        # and scramble them into a new input array (inputs) with the
        # length of the pre-defined batch size
                    
        for i, idx in enumerate(np.random.randint(0, len_memory, size=inputs.shape[0])):    
            # Obtain the parameters for Bellman from memory,
            # S.A.R.S: state, action, reward, new state
            state_t, action_t, reward_t, state_tp1 = self.memory[idx][0]
            # Boolean flag to check if the game is over
            game_over = self.memory[idx][1]
            inputs[i] = state_t    
            
            # Calculate the targets for the state at time t
            targets[i] = modelR.predict(state_t)[0]
            
            # Calculate the reward at time t+1 for action at time t
            Q_sa = np.max(modelQ.predict(state_tp1)[0])
           
            if game_over:
                # When game is over we have a definite reward
                targets[i, action_t] = reward_t
            else:
                # Update the part of the target for which action_t occured to new value
                # Q_new(s,a) = reward_t + gamma * max_a' Q(s', a')
                
                targets[i, action_t] = reward_t + self.discount * Q_sa
        
        return inputs, targets

In [5]:
# Creating an instance of the experience replay class
ER = ExperienceReplay()
ER

<__main__.ExperienceReplay at 0x20f46e5f1c0>

Once the memory buffer is filled. We use the ```process()``` function to generate a pairing of input states and target Q-values. These are used in batches to update the models Q and R. We will use these returned values in functions from subsequent sections.
<br></br>