In [None]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Environment Implementation Turorial

By the end of this notebook you will know how to implement your oun environment for solving your own problem. For this examples, we have select a real life dataset with historical stock data from Apple company (https://www.kaggle.com/tarunpaparaju/apple-aapl-historical-stock-data) and we want to train a trading agent.

In [None]:
from environments.env_base import EnvInterface, ActionSpaceInterface
from RL_Problem import rl_problem
from RL_Agent import ppo_agent_discrete_parallel
from RL_Agent.base.utils.networks import networks
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Flatten
from RL_Agent.base.utils import history_utils
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random
import gc

## Trading Environment

The environment is an entity that implements a problem of interest on a reinforcement learning compatible way, this means as a decision making problem, more specifically, as Markov Decision Problem. 

Our particular problem have three posible actions: 1) buy stock, 2) sell stock and 3) ilde, do nothing. The agent have to choose the best action with the objective of maximize the profit during a sequence of 100 days. Each day the agent can take an action that is executet jut before the market closure time. In order to simplify the problem, the agent only can have one unit of stock in each time step, this means that the agent has to sell the stock before buying again.

The state is formed by three elements: 

* Price variation: the variation of the stock prices between two consecutive days expresed in x times the last price. This is calculated as "price_variation = price(day)/price(day-1)" where "day" represent the current day.
* Gains: Potential profit if the stock is sold. If the agent do not have bought stock, "gains = 0". If the agent have bought stock, "gains = price(day)/price(buying_day)", where "day" represent the current day.
* Stock Bought: This is a flag that is set to "0" where the agent does not have stock in possesion and is set to "1" when the agent have stock in possesion.

The reward function return better values where the agent sell stock with good proffit. Where the agent sell stock losing money the reward will return worse values. Otherwise the reward will be neutral. Finally, if the episode terminate with stock in possesion the reward will be proportional to the money inverted to buy that stock.

Especifically, we use relatives values to the buy price and sell price "(sell price/buy price) -1" as reward, where the agent obtain profit "reward > 0" and if the agent obtain losses "reward < 0". In the other cases "reward = 0"

Before creating the environment we need to unzip the dataset

In [None]:
!unzip tutorials/data/HistoricalQuotes.zip -d tutorials/data/HistoricalQuotes.csv

In the next cell, we define the action space extending the "ActionSpaceInterface" from "environments.env_base.py". The action space requires setting the number of actions "self.n". If the action space where continuous, we would be required to define the actions bounds in properties: 1) "self.low", low bound and 2) "self.high", high bound. Both bounds are float numbers and reference the overall maximun and minumum bounds for all possible actions.

In [None]:
class action_space(ActionSpaceInterface):
    def __init__(self):
        # Number of actions
        self.n = 3

        # buy -> buy stock, sell = sell stock y idle = do nothing.
        self.actions = {'buy': 0,
                        'sell': 1,
                        'idle': 2}

The next cell implements the environment itself extending "EnvInterface" from "environments.env_base.py". 
We are required to define the next functions that complains with the OpenAI Gym interface:

* reset(): Reset the environment to an initial state. Receives nothing and Returns the state (ndarray).
* step(action): Execute an action in the environment producing a transition between the current state and the next state. Receives an action (ndarray) and Returns the next state (ndarray), reward (float), done (boolean. True if a terminal state is reached) and aditional iformation (dict). 
* render(): Render the environment in a graphical way or via command line. Receives nothing and Return nothing.
* close(): Close the rendering window.

In [None]:
class StockTrading(EnvInterface):
    """
    Environment for stock trading using historical stock data from Apple.
    https://www.kaggle.com/tarunpaparaju/apple-aapl-historical-stock-data
    """

    def __init__(self):
        super().__init__()

        # Define the actions and observation spaces. They are required properties.
        self.action_space = action_space()
        
        # The observation space is used by the library to know the state shape. In this case, we define a dummy 
        # (array of zeroes) because we do not need the values but we have acces to the shape.
        self.observation_space = np.zeros(3, )

        # Maximum iterations per episode.
        self.max_iter = 99

        # Load data
        dataset = pd.read_csv("tutorials/data/HistoricalQuotes.csv")

        # Invert data order to haver the correct cronological order.
        dataset = dataset.iloc[::-1]
        print(dataset.head())

        # Preprocess data
        self.data = dataset.iloc[:, 1].values
        self.data = [float(d.split()[0][1:]) for d in self.data]

        # Show dataset data
        fig, ax = plt.subplots(1)
        ax.plot(range(len(self.data)), self.data)
        ax.set(xlabel='days', ylabel='price $',
               title='Stock Trading Data')
        ax.grid()
        plt.show()


        # Auxiliar environment variables
        self.index_day = None  # Current day.
        self.profit = None  # Current profit
        self.stock_buying_price = 0. # Flag to know where the agent have stock in possesion.

        random.seed()
        
        # Buffer of days of buying and selling for rendering purposes.
        self.render_buy_index = []
        self.render_sell_index = []

    def reset(self):
        """
        Reset the environment to an initial state.
        :return: observation/state. numpy array of state shape
        """
        gc.collect()
        
        # Select a random init day.
        self.index_day = random.randint(11, len(self.data)-101)
        self.init_index = self.index_day

        price = self.data[self.index_day]

        # Create the inital state. [price_variation, gains, stock bought].
        state = np.array([1., 1., 0.])

        # Initialize control variables and buffers.
        self.last_action = self.action_space.actions['idle']
        self.last_state = state
        self.last_price = price
        self.first_price = price
        self.last_reward = 0.

        # Initializa auxiliar environment variables
        self.iter = 0
        self.profit = 0.
        self.index_day += 1
        self.stock_buying_price = 0.

        # Initialize rendering variables.
        self.render_buy_index = []
        self.render_sell_index = []

        return state

    def step(self, action):
        """
        Execute the action to get the next state.
        :param action: integer in [0, 3]
        :return: state:   numpy array of state shape.
                 reward: float
                 done: bool
                 info: dict or None
        """
        price = self.data[self.index_day]

        price_variation = price/self.last_price

        # Terminal state if the maximun number of iterations is reached.
        done = self.iter >= self.max_iter

        # Calculate reward.
        reward = 0.
        profit = 0.
        if action == self.action_space.actions['buy']:
            if self.stock_buying_price > 0:
                action = 4  # If we already have stock we can not buy any more. Action is "idle" buth we assing value 4 for rendering purposes.
            else:
                self.stock_buying_price = self.last_price  # Store the buying price.

        elif action == self.action_space.actions['sell']:
            if self.stock_buying_price > 0:
                profit = self.last_price - self.stock_buying_price  # Calculate profit.
                reward = (self.last_price/self.stock_buying_price)-1.  # Calculate reward.

                self.stock_buying_price = 0.
            else:
                action = 3  # If we do not have stock we can not sell anything. Action is "idle" in this case, we assing value 3 for rendering purposes.
        
        # If the agent did not sell the stock onece the episode has finnished, we calculate the loss with respect to the buying price.
        if done and self.stock_buying_price > 0.:
            reward = (self.last_price/self.stock_buying_price)-1.

        self.profit += profit

        gains = price/self.stock_buying_price if self.stock_buying_price > 0. else 0.


        # Create the inital state. [price_variation, gains, stock bought].
        state = np.array([price_variation, gains, 1. if self.stock_buying_price > 0. else 0.])


        self.last_state = state
        self.last_price = price
        self.last_action = action
        self.last_reward = reward
        self.iter += 1
        self.index_day += 1

        return state, reward, done, None

    def close(self):
        # Close the rendering figure
        plt.close(1)

    def render(self):
        plt.clf()

        fig = plt.figure(1)
        ax = fig.add_subplot(2, 1, 1)


        data = self.data

        # Get the current stock price.
        valor = data[self.index_day-2]
        ax.plot(range(100), data[self.init_index-1: self.init_index+99])
        ax.set(xlabel='days', ylabel='price $',
               title='Stock Trading')

        if self.last_action == self.action_space.actions['buy']:
            marker = "^"
            color = 'g'
            self.render_buy_index.append([self.iter, valor])
        elif self.last_action == self.action_space.actions['sell']:
            marker = "v"
            color = 'r'
            self.render_sell_index.append([self.iter, valor])
        elif self.last_action == 3:  # Sell, but without stock the action is transformed to idle.
            marker = "v"
            color = 'y'
        elif self.last_action == 4:  # Buy, but we already have stock so the action is tranformed to idle.
            marker = "^"
            color = 'y'
        else:  # 'idle'
            marker = "D"
            color = 'b'

        for s in self.render_sell_index:
            ax.plot(s[0], s[1], marker=7, color='r')

        for b in self.render_buy_index:
            ax.plot(b[0], b[1], marker=6, color='g')

        ax.plot(self.iter, valor, marker=marker, color=color)

        text1 = "profit: {:.1f}".format(self.profit)
        text2 = "   stock buying price: {:.1f}".format(self.stock_buying_price)
        text3 = "   current price: {:.1f}".format(valor)
        text4 = "   reward: {:.4f}".format(self.last_reward)
        ax.text(0.02, 0.95, text1 + text2 + text3 + text4, horizontalalignment='left', verticalalignment='center',
                transform=ax.transAxes)
        ax.grid()

        # We additionally show a sliding window of the last 20 days to see what the agent sees.
        ax2 = fig.add_subplot(2, 1, 2)
        ax2.plot(range(self.iter-19, self.iter+1), data[self.index_day-21: self.index_day - 1])
        ax2.set(xlabel='last 20 days', ylabel='price $',
               title='20 days window')
        ax2.plot(self.iter, valor, marker=marker, color=color)
        for s in self.render_sell_index:
            if self.iter - s[0] < 20:
                ax2.plot(s[0], s[1], marker="v", color='r')

        for b in self.render_buy_index:
            if self.iter - b[0] < 20:
                ax2.plot(b[0], b[1], marker="^", color='g')

        ax2.grid()

        plt.draw()
        plt.pause(0.01)


## Define the Environment

Next cell build the environment class that we have created.

In [None]:
env = StockTrading()

## Defining the Neural Network Architecture

We define the network architecture using the function "ppo_net" from "RL_Agent.base.utils.networks.networks.py" which return a dictionary. As we are using an Actor-Critic agent, this function will requires the user to define the parameters of both neural networks, the actor net and the critic net.

In [None]:
# Definimos la red neuronal de la forma mas avanzada que permite la libreria con un modelo secuencial de keras
def actor_lstm_custom_model(input_shape):
    actor_model = Sequential()
    actor_model.add(LSTM(64, input_shape=input_shape, activation='tanh'))
    actor_model.add(Dense(128, input_shape=input_shape, activation='relu'))
    actor_model.add(Dense(128, activation='relu'))

    return actor_model

def critic_lstm_custom_model(input_shape):
    critic_model = Sequential()
    critic_model.add(LSTM(64, input_shape=input_shape, activation='tanh'))
    critic_model.add(Dense(64, input_shape=input_shape, activation='relu'))
    critic_model.add(Dense(64, activation='relu'))

    return critic_model

In [None]:
net_architecture = networks.ppo_net(use_custom_network=True,
                                    actor_custom_network=actor_lstm_custom_model,
                                    critic_custom_network=critic_lstm_custom_model)

## Defining the RL Agent


Here, we define the RL agent using the next parameters:

* actor_lr: learning rate for training the actor neural network.
* critic_lr: learning rate for training the neural network.
* batch_size: Size of the batches used for training the neural network. 
* memory_size: Size of the buffer filled with experiences in each algorithm iteration. 
* epsilon: Determines the amount of exploration (float between [0, 1]). 0 -> Full Exploitation; 1 -> Full exploration.
* epsilon_decay: Decay factor of the epsilon. In each iteration we calculate the new epslon value as: epsilon' = epsilon * epsilon_decay.
* esilon_min: minimun value epsilon can reach during the training procedure.
* net_architecture: net architecture defined before.
* n_stack: number of stacked timesteps to form the state.
* loss_critic_discount: Discount factor for the loss comming from the critic in the actor net calculation.
* loss_entropy_beta: Discount factor for the entropy term of the loss function.
* tensorboard_dir: path to folder for store tensorboard summaries.

In [None]:
agent = ppo_agent_discrete_parallel.Agent(actor_lr=1e-3,
                                         critic_lr=1e-3,
                                         batch_size=128,
                                         memory_size=100,
                                         epsilon=1.0,
                                         epsilon_decay=0.97,
                                         epsilon_min=0.15,
                                         net_architecture=net_architecture,
                                         n_stack=5,
                                         loss_critic_discount=0.001,
                                         loss_entropy_beta=0.01,
                                         tensorboard_dir='tensorboard_logs')

In [None]:
# from RL_Agent import dpg_agent

# net_architecture = networks.dpg_net(use_custom_network=True,
#                                     custom_network=actor_lstm_custom_model)
# agent = dpg_agent.Agent(learning_rate=1e-4,
#                         batch_size=64,
#                         net_architecture=net_architecture,
#                         n_stack=20,
#                        tensorboard_dir='tensorboard_logs')

## Build a RL Problem

Create a RL problem were the comunications between agent and environment are managed. In this case, we use the funcionality from "RL_Problem.rl_problem.py" which makes transparent to the user the selection of the matching problem. The function "Problem" automaticaly selects the problem based on the agent used.

In [None]:
problem = rl_problem.Problem(env, agent)

## Solving the RL Problem

Next step is solving the RL problem that we have define.

In [None]:
problem.solve(1000, render=False)

Lest see the performance of the trained agent. To correctly see the execution of this environment we need to run matplotlib on window mode. We do this by runing the "%matplotlib qt" instruction.

In [None]:
%matplotlib qt

problem.test(render=False, n_iter=20)

In [None]:
hist = problem.get_histogram_metrics()
history_utils.plot_reward_hist(hist, 10)

## Run Tensorboard to See the Recorded Summaries

Lets see the tensorboard logs. Next cell executes the command that runs the tensorboard service. To see the result, you have to open a tab in your browser on the url that the command shows, usually http://localhost:6006/

In [None]:
!tensorboard --logdir=tensorboard_logs

# Takeaways

- We trained a multithread PPO agent
- We learned how to create an environment for approaching custom problems.
- We learned how to use the python interface for environments and its required properties and functions.
- We used real world data to create a trading bot.