# Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

# Training the agent on the Game environment

After defining the agent in the last section, in this section you will learn and define the flow of the training of the agent on the environment. 

This notebook will cover:
1. Definition of episodes
2. Usage of epsilon value
3. Usage of experience replay buffer
4. Batch training of agent on sampled experiences

In this notebook, you will perform the following steps:

1. [Import Modules](#modules)
1. [Read data](#read)
2. [Define training hyperparameters](#hyper)
3. [Training the agent](#run)

<a id='modules'></a> 
## Import modules

In the code below we import the modules. We import the Game and Experience classes from the quantra_reinforcement_learning module. We also import the module which initialises the agent, init_net. We had seen these in the previous sections.

You can find the quantra_reinforcement_learning module from the last section of this course '**Python Codes and Data**' unit.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import pickle

# Appends new file paths to import modules
import sys
sys.path.append("..")

# To suppress GPU related warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

from data_modules.quantra_reinforcement_learning import reward_exponential_pnl
from data_modules.quantra_reinforcement_learning import Game
from data_modules.quantra_reinforcement_learning import init_net
from data_modules.quantra_reinforcement_learning import ExperienceReplay

<a id='read'></a> 
## Read price data

In [2]:
# The data is stored in the directory 'data'
path = '../data_modules/'

bars5m = pd.read_pickle(path + 'PriceData5m.bz2')
bars5m.head(5)

Unnamed: 0_level_0,open,high,low,close,volume
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-04 09:35:00-05:00,91.711,91.809,91.703,91.76,4448908.0
2010-01-04 09:40:00-05:00,91.752,91.973,91.752,91.932,4380988.0
2010-01-04 09:45:00-05:00,91.94,92.022,91.928,92.005,2876633.0
2010-01-04 09:50:00-05:00,92.005,92.177,91.973,92.177,4357079.0
2010-01-04 09:55:00-05:00,92.168,92.177,92.038,92.079,2955068.0


<a id='hyper'></a> 
## Define the training hyperparameters


Below are the hyperparameters used by the ```run()``` function:

1. EPSILON: the initial value of the policy probability epsilon is set.
2. MAX_MEM:  maximum length of the experience replay buffer.
3. BATCH_SIZE: number of experiences that need to be sampled from the experience replay buffer.
4. LKBK: number of bars used as lookback for training.
5. START_IDX: initial index of the dataset where the agent starts learning.
6. EPS_MIN: This sets the minimum epsilon value. 
7. DISCOUNT_RATE: This sets the tradeoff fraction between reward and the Q-value of the next state

In [3]:
rl_config = {

    # LEARNING_RATE: the learning rate used in the algorithm's optimizer
    'LEARNING_RATE': 0.05,

    # LOSS_FUNCTION: the loss function used in the algorithm
    'LOSS_FUNCTION': 'mse',

    # ACTIVATION_FUN: the activation function used in the neural network model
    'ACTIVATION_FUN': 'relu',

    # NUM_ACTIONS: the number of actions that the agent can take in each state
    'NUM_ACTIONS': 3,

    # HIDDEN_MULT: a multiplier used to determine the size of the hidden layer in the neural network model
    'HIDDEN_MULT': 2,

    # DISCOUNT_RATE: the discount rate used in the algorithm
    'DISCOUNT_RATE': 0.99,

    # LKBK: the number of previous time steps to consider in the algorithm
    'LKBK': 10,

    # BATCH_SIZE: the size of the mini-batch used in the algorithm
    'BATCH_SIZE': 1,

    # MAX_MEM: the maximum size of the memory used in the algorithm
    'MAX_MEM': 600,

    # EPSILON: the initial value of epsilon used
    'EPSILON': 0.01,

    # EPS_MIN: the minimum value of epsilon used
    'EPS_MIN': 0.001,

    # START_IDX: the starting index used in the algorithm
    'START_IDX': 3000,

    # RF: the reward function used in the algorithm
    'RF': reward_exponential_pnl,

    # TEST_MODE: a boolean that indicates whether the algorithm is in test mode or not. Set TEST_MODE to False when running in the local system
    'TEST_MODE': True,
    
    # PRELOAD: a boolean that indicates whether to preload the model from disk
    'PRELOAD': False,
    
    # UPDATE_QR: a boolean that indicates whether to update the Q-values in the algorithm
    'UPDATE_QR': True,
    
    # Saving the weights
    'WEIGHTS_FILE': '../data_modules/indicator_model_fx_pair_0.h5',
    
    # Saving the trades
    'TRADE_FILE': '../data_modules/trade_logs_fx_pair_0.bz2',
    
    # Experience replay
    'REPLAY_FILE': '../data_modules/memory_fx_pair_0.bz2',
}

<a id='run'></a> 
## Training the agent

In the code below you define the ```run()``` function. This function helps to train the agent in the Game environment. 

The logical flow of the agent training is:

1. You first initialise the Game environment and the ANN agents. You also initialise the experience replay buffer.
2. Each iteration of exploring the environment is called an episode. 
3. For each episode, you iterate over the states generated by the underlying OHLC data step by step.
4. You use an exponential decay function to set a value called epsilon. Based on this value you decide to take an action with the maximum q-value (optimal) or a suboptimal action that you you want the agent to explore.
5. Once the action is selected, take the action. This will give the next state and the reward from the environment.
6. Add the experience [current state, action, reward, next state] to the experience replay buffer.
7. Sample experiences from the experience replay buffer. Use these experiences to calculate the target q-value of a given state and action.
8. Target is calculated using the two models: modelR and modelQ as covered in previous sections.
8. Use this target q-value to train the agent and reduce loss.

In [4]:
def run(bars5m, rl_config):
    """
    Function to run the RL model on the passed price data
    """
    
    pnls = []
    trade_logs = pd.DataFrame()
    episode = 0

    ohlcv_dict = {
        'open': 'first',
        'high': 'max',
        'low': 'min',
        'close': 'last',
        'volume': 'sum'
    }

    bars1h = bars5m.resample('1H', label='left', closed='right').agg(ohlcv_dict).dropna()
    bars1d = bars1h.resample('1D', label='left', closed='right').agg(ohlcv_dict).dropna()

    """---Initialise a NN and a set up initial game parameters---"""
    env = Game(bars5m, bars1d, bars1h, rl_config['RF'],
               lkbk=rl_config['LKBK'], init_idx=rl_config['START_IDX'])
    q_network, r_network = init_net(env, rl_config)
    exp_replay = ExperienceReplay(max_memory=rl_config['MAX_MEM'], discount=rl_config['DISCOUNT_RATE'])

    """---Preloading the model weights---"""
    if rl_config['PRELOAD']:
        q_network.load_weights(rl_config['WEIGHTS_FILE'])
        r_network.load_weights(rl_config['WEIGHTS_FILE'])
        exp_replay.memory = pickle.load(open(rl_config['REPLAY_FILE'], 'rb'))

    r_network.set_weights(q_network.get_weights())

    """---Loop that steps through one trade (game) at a time---"""
    while True:
        """---Stop the algo when end is near to avoid exception---"""
        if env.curr_idx >= len(bars5m)-1:
            break

        episode += 1

        """---Initialise a new game---"""
        env = Game(bars5m, bars1d, bars1h, rl_config['RF'],
                   lkbk=rl_config['LKBK'], init_idx=env.curr_idx)
        state_tp1 = env.get_state()

        """---Calculate epsilon for exploration vs exploitation random action generator---"""
        epsilon = rl_config['EPSILON']**(np.log10(episode))+rl_config['EPS_MIN']

        game_over = False
        cnt = 0

        """---Walk through time steps starting from the end of the last game---"""
        while not game_over:
        
            if env.curr_idx >= len(bars5m)-1:
                break

            cnt += 1
            state_t = state_tp1

            """---Generate a random action or through q_network---"""
            if np.random.rand() <= epsilon:
                action = np.random.randint(0, 3, size=1)[0]

            else:
                q = q_network.predict(state_t)
                action = np.argmax(q[0])

            """---Updating the Game---"""
            reward, game_over = env.act(action)

            """---Updating trade/position logs---"""
            tl = [[env.curr_time, env.position, episode]]
            if game_over:
                tl = [[env.curr_time, 0, episode]]
            trade_logs = trade_logs.append(tl)

            """---Move to next time step---"""
            env.curr_idx += 1
            state_tp1 = env.get_state()

            """---Adding state to memory---"""
            exp_replay.remember(
                [state_t, action, reward, state_tp1], game_over)

            """---Creating a new Q-Table---"""
            inputs, targets = exp_replay.process(
                q_network, r_network, batch_size=rl_config['BATCH_SIZE'])
            env.pnl_sum = sum(pnls)

            """---Update the NN model with a new Q-Table"""
            q_network.train_on_batch(inputs, targets)

            if game_over and rl_config['UPDATE_QR']:
                r_network.set_weights(q_network.get_weights())

        pnls.append(env.pnl)

        print("Trade {:03d} | pos {} | len {} | approx cum ret {:,.2f}% | trade ret {:,.2f}% | eps {:,.4f} | {} | {}".format(
            episode, env.position, env.trade_len, sum(pnls)*100, env.pnl*100, epsilon, env.curr_time, env.curr_idx))

        if not episode % 10:
            print('----saving weights, trade logs and replay buffer-----')
            r_network.save_weights(rl_config['WEIGHTS_FILE'], overwrite=True)
            trade_logs.to_pickle(rl_config['TRADE_FILE'])
            pickle.dump(exp_replay.memory, open(rl_config['REPLAY_FILE'], 'wb'))

        if not episode % 7 and rl_config['TEST_MODE']:
            print('\n**********************************************\nTest mode is on due to resource constraints and therefore stopped after 7 trades. \nYou can trade on full dataset on your local computer and set TEST_MODE flag to False in rl_config dictionary. \nThe full code file, quantra_reinforemcent_learning module and data file is available in last unit of the course.\n**********************************************\n')
            break

    if not rl_config['TEST_MODE']:
        print('----saving weights, trade logs and replay buffer-----')
        r_network.save_weights(rl_config['WEIGHTS_FILE'], overwrite=True)
        trade_logs.to_pickle(rl_config['TRADE_FILE'])
        pickle.dump(exp_replay.memory, open(rl_config['REPLAY_FILE'], 'wb'))

    print('***FINISHED***')

In [5]:
# Call the run function and pass the dataframe and hyperparameters
run(bars5m, rl_config)

Trade 001 | pos 1 | len 2 | approx cum ret 0.04% | trade ret 0.04% | eps 1.0010 | 2010-03-01 12:50:00-05:00 | 3003
Trade 002 | pos -1 | len 1 | approx cum ret 0.09% | trade ret 0.05% | eps 0.2510 | 2010-03-01 13:00:00-05:00 | 3005
Trade 003 | pos 1 | len 15 | approx cum ret 0.22% | trade ret 0.13% | eps 0.1121 | 2010-03-01 14:20:00-05:00 | 3021
Trade 004 | pos 1 | len 14 | approx cum ret 0.12% | trade ret -0.10% | eps 0.0635 | 2010-03-01 15:35:00-05:00 | 3036
Trade 005 | pos 1 | len 2 | approx cum ret 0.10% | trade ret -0.02% | eps 0.0410 | 2010-03-01 15:50:00-05:00 | 3039
Trade 006 | pos 1 | len 105 | approx cum ret 1.11% | trade ret 1.01% | eps 0.0288 | 2010-03-03 11:40:00-05:00 | 3145
Trade 007 | pos 1 | len 131 | approx cum ret 1.52% | trade ret 0.41% | eps 0.0214 | 2010-03-05 09:40:00-05:00 | 3277

**********************************************
Test mode is on due to resource constraints and therefore stopped after 7 trades. 
You can trade on full dataset on your local computer an

As we can see above, the PnL, epsilon, entry date and episode length are printed for each episode.

Here we defined the run function in which the agent gathers new experiences in the Game environment and learns from them. In the coming units, we will run and evaluate this function for synthetic as well as real market data.