<a href="https://colab.research.google.com/github/aCStandke/ReinforcementLearning/blob/main/ThirdStockEnivornment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Third Stock Trading Environment


  This third stock trading environment is based on Adam King's article as found here:[Creating Bitcoin trading bots don’t lose money](https://medium.com/towards-data-science/creating-bitcoin-trading-bots-that-dont-lose-money-2e7165fb0b29). Similar to the first stock trading environment based on Maxim Lapan's implementation as found in chapter eight of his book [Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition](https://www.amazon.com/Deep-Reinforcement-Learning-Hands-optimization/dp/1838826998), the agent is trading in the environment of the [SPY ETF](https://www.etf.com/SPY?L=1) except in this trading environment the agent is tasked with two discrete actions of not only buying, selling or holding shares but also tasked with determining the amount to buy/sell ranging from 1 to 100 (which will be converted into pecentage form i.e. 1/100=1%, 100/100=100%) based on its trading account/balance [trading account](https://www.investopedia.com/terms/t/tradingaccount.asp#:~:text=A%20trading%20account%20is%20an,margin%20requirements%20set%20by%20FINRA.).  


In [1]:
# ignore warning messages because they are annoying lol
import warnings
warnings.filterwarnings('ignore')

# Installing Necessary Package for Training the Trading Agent

To train the Trading Agent the package [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/index.html) was used. As stated in the docs: 
> Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines. And steems from the paper [Stable-Baselines3: Reliable Reinforcement Learning Implementations](https://jmlr.org/papers/volume22/20-1364/20-1364.pdf)
The algorithms in this package will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of. We expect these tools will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. We also hope that the simplicity of these tools will allow beginners to experiment with a more advanced toolset, without being buried in implementation details.

---
## Proximal Policy Optimization(PPO):

Because in this environment the Agent will be executing continous actions, the Proximal Policy Optimization(PPO) algorithm was chosen. As detailed by the authors [PPO](https://arxiv.org/pdf/1707.06347.pdf)


> We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).


PPO uses the following novel objective function:

$L^{CLIP}(θ)=\hat{E}_t[min(r_{t}(θ)\hat{A}_t,clip(r_{t}(θ), 1-ϵ, 1+ϵ)\hat{A}_t]$

*  $\theta$ is the policy parameter
*  $\hat{E}_t$ denotes the empirical expectation over timesteps
*  $r_{t}$ is the ratio of the probability under the new and old policies, respectively
*  $\hat{A}_t$ is the estimated advantage at time t
*  $\epsilon$ is the clipping hyperparameter, usually 0.1 or 0.2


As detailed by the authors [openAI](https://openai.com/blog/openai-baselines-ppo/#ppo)


> This objective implements a way to do a Trust Region update which is compatible with Stochastic Gradient Descent, and simplifies the algorithm by removing the KL penalty and need to make adaptive updates. In tests, this algorithm has displayed the best performance on continuous control tasks and almost matches ACER’s performance on Atari, despite being far simpler to implement


  









In [2]:
!pip install stable-baselines3[extra]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Installing the Necessary Packages for Visualizing the Trading Agent's Envirnoment on Google Colab Notebooks

In [3]:
!pip install mpl_finance #used for plotting the candelstick graph
!pip install moviepy #
!pip install imageio_ffmpeg #
!pip install pyvirtualdisplay > /dev/null 2>&1 #used to create a display for vm
!apt-get install x11-utils > /dev/null 2>&1 #
!pip install pyglet==v1.3.2 > /dev/null 2>&1 #
!apt-get install -y xvfb python-opengl > /dev/null 2>&1 #

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import random
import json
import gym
from gym import spaces
from gym.utils import seeding
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib import style
import json
import datetime as dt
from stable_baselines3 import PPO
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.env_util import DummyVecEnv
from stable_baselines3.common.callbacks import EvalCallback
import torch as th
import torch.nn as nn
import collections
import datetime
from sklearn import preprocessing
from mpl_finance import candlestick_ochl as candlestick
import math
import os
import moviepy.video.io.ImageSequenceClip
import glob
import re
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display

In [5]:
# stock environment parameters
MAX_ACCOUNT_BALANCE = 2147483647
MAX_NUM_SHARES = 2147483647
MAX_SHARE_PRICE = 4294967295
LOOKBACK_WINDOW_SIZE = 10
MAX_STEPS = 20000
# max trading of agent in real environment, 60 days
MAX_TRADING_SESSION=60
INITIAL_ACCOUNT_BALANCE = 10000
# default percentage of stock price trading agent pays broker when 
# buying/selling, default is 0.1% (i.e. very reasonable)
DA_COMMISION = 0.1


# Visualization Parameters
style.use('dark_background')
VOLUME_CHART_HEIGHT = 0.33
UP_COLOR = '#27A59A'
DOWN_COLOR = '#EF534F'
UP_TEXT_COLOR = '#73D3CC'
DOWN_TEXT_COLOR = '#DC2C27'


# Video Parameters
fps=1
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7fc6ccf33750>

 ## Creating visualization for Stock/ETF Environment
 As detalied by Adam King in his article titled
 [Rendering elegant stock trading agents using Matplotlib 
 and Gym](https://towardsdatascience.com/visualizing-stock-trading-agents-using-matplotlib-and-gym-584c992bc6d4)

In [6]:
def date2num(date):
  converter = mdates.strpdate2num('%Y-%m-%d')
  return converter(date)

class StockTradingGraph:
  """A stock trading visualization using matplotlib made to render 
    OpenAI gym environments"""
  def __init__(self, df, title=None):
    self.df = df
    self.net_worths = np.zeros(len(df.date))
    self.count = 0
    self._step = 0
    

    # Create a figure on screen and set the title
    fig = plt.figure()
    fig.suptitle(title)
    # Create top subplot for net worth axis
    self.net_worth_ax = plt.subplot2grid(shape=(6, 1), loc=(0, 0), rowspan=2,     
      colspan=1)
  
    # Create bottom subplot for shared price/volume axis
    self.price_ax = plt.subplot2grid(shape=(6, 1), loc=(2, 0), rowspan=8, 
      colspan=1, sharex=self.net_worth_ax)
    # Create a new axis for volume which shares its x-axis with price
    self.volume_ax = self.price_ax.twinx()
    # Add padding to make graph easier to view
    plt.subplots_adjust(left=0.11, bottom=0.24, right=0.90, 
      top=0.90, wspace=0.2, hspace=0)
    
  
  def render(self, current_step, net_worth, trades, window_size=40):
    self.net_worths[current_step] = net_worth
    window_start = max(current_step - window_size, 0)
    step_range = range(window_start, current_step + 1)
    # Format dates as timestamps, necessary for candlestick graph
    dates = np.array([date2num(x) for x in self.df.date[step_range]])

    self._render_net_worth(current_step, net_worth, step_range, dates)
    self._render_price(current_step, net_worth, dates, step_range)
    self._render_volume(current_step, net_worth, dates, step_range)
    self._render_trades(current_step, trades, step_range)
        
    # Format the date ticks to be more easily read
    self.price_ax.set_xticklabels(self.df.date[step_range], rotation=45, ha='right')
        
    # Hide duplicate net worth date labels
    plt.setp(self.net_worth_ax.get_xticklabels(), visible=False)
    # Necessary to view frames before they are unrendered  
    if self.count < 1:
      plt.savefig(f'{current_step}.png')
      self._step = current_step
      self.count += 1
    else: 
      if self._step + LOOKBACK_WINDOW_SIZE==current_step: 
        plt.savefig(f'{current_step}.png')
        self._step = current_step

  def _render_net_worth(self, current_step, net_worth, step_range, dates):
    # Clear the frame rendered last step
    self.net_worth_ax.clear()

    # Plot net worths
    self.net_worth_ax.plot_date(dates, self.net_worths[step_range],
                                '-', label='Net Worth')

    # Show legend, which uses the label we defined for the plot above
    self.net_worth_ax.legend()
    legend = self.net_worth_ax.legend(loc=2, ncol=2, prop={'size': 8})
    legend.get_frame().set_alpha(0.4)

    last_date = date2num(self.df.date[current_step])
    last_net_worth = self.net_worths[current_step]

    # Annotate the current net worth on the net worth graph
    self.net_worth_ax.annotate('{0:.2f}'.format(net_worth), (last_date, last_net_worth),
                                   xytext=(last_date, last_net_worth),
                                   bbox=dict(boxstyle='round',
                                             fc='w', ec='k', lw=1),
                                   color="black",
                                   fontsize="small")

    # Add space above and below min/max net worth
    self.net_worth_ax.set_ylim(
            min(self.net_worths[np.nonzero(self.net_worths)]) / 1.25, max(self.net_worths) * 1.25)
    

  def _render_price(self, current_step, net_worth, dates, step_range):
        self.price_ax.clear()

        # Format data for OHCL candlestick graph
        candlesticks = zip(dates,
                           self.df.open[step_range], self._cur_close(step_range),
                           self._cur_high(step_range), self._cur_low(step_range))

        # Plot price using candlestick graph from mpl_finance
        candlestick(self.price_ax, candlesticks, width=0.5,
                    colorup=UP_COLOR, colordown=DOWN_COLOR)

        last_date = date2num(self.df.date[current_step])
        last_close = self._cur_close(current_step)
        last_high = self._cur_high(current_step)

        # Print the current price to the price axis
        self.price_ax.annotate('{0:.2f}'.format(last_close), (last_date, last_close),
                               xytext=(last_date, last_high),
                               bbox=dict(boxstyle='round',
                                         fc='w', ec='k', lw=1),
                               color="black",
                               fontsize="small")

        # Shift price axis up to give volume chart space
        ylim = self.price_ax.get_ylim()
        self.price_ax.set_ylim(ylim[0] - (ylim[1] - ylim[0])
                               * VOLUME_CHART_HEIGHT, ylim[1])
        
  def _render_volume(self, current_step, net_worth, dates, step_range):
        self.volume_ax.clear()

        volume = np.array(self.df.volume[step_range])

        pos = self.df.open[step_range] - \
            self._cur_close([step_range]) < 0
        neg = self.df.open[step_range] - \
            self._cur_close([step_range]) > 0


        # Color volume bars based on price direction on that date
        self.volume_ax.bar(dates[pos], volume[pos], color=UP_COLOR,
                           alpha=0.4, width=0.5, align='center')
        self.volume_ax.bar(dates[neg], volume[neg], color=DOWN_COLOR,
                           alpha=0.4, width=0.5, align='center')

        # Cap volume axis height below price chart and hide ticks
        self.volume_ax.set_ylim(0, max(volume) / VOLUME_CHART_HEIGHT)
        self.volume_ax.yaxis.set_ticks([])
       

  def _render_trades(self, current_step, trades, step_range):
    for trade in trades:
      if trade['step'] in step_range:
        date = date2num(self.df.date[trade['step']])
        high = self._cur_high(trade['step'])
        low = self._cur_low(trade['step'])
         
        if trade['type'] == 'buy':
          high_low = low
          color = UP_TEXT_COLOR
        else:
          high_low = high
          color = DOWN_TEXT_COLOR
        
        total = '{0:.2f}'.format(trade['total'])
        
        # Print the current price to the price axis
        self.price_ax.annotate(f'${total}', (date, high_low),
                             xytext=(date, high_low),
                             color=color,
                             fontsize=8,
                             arrowprops=(dict(color=color)))
                
  def _cur_close(self, offset):
      """
      Calculate real close price for the current bar
      """
      return self.df.real_close[offset]

  def _cur_high(self, offset):
      """
      Calculate real close price for the current bar
      """
      return self.df.real_high[offset]

  def _cur_low(self, offset):
      """
      Calculate real close price for the current bar
      """
      return self.df.real_low[offset]

  def close(self):
    plt.close()
              
  

In [15]:
# Stock/ETF Trading Enviornment
class StockTradingEnv(gym.Env):
    """A stock trading environment for OpenAI gym"""
    metadata = {'render.modes': ['human']}

    def __init__(self, data, random_ofs_on_reset=True):
        super(StockTradingEnv, self).__init__()
        self.data = data
        self.random_ofs_on_reset = random_ofs_on_reset
        self.visualization = None 
        self.track_reward = 0
        self.current_step = 0
        self.commission = DA_COMMISION

        # Actions of the format Buy x%, Sell x%, Hold, etc.
        self.action_space = spaces.MultiDiscrete(nvec=[3, 100], dtype=np.int16)

        # Prices contains the OHCL values for the last five prices
        self.observation_space = spaces.Box(
            low=0, high=1, shape=self.shape, dtype=np.float32)
        
        self.seed()

    def _reset_session(self):
      self.current_step = 0
      if not self.random_ofs_on_reset:
        self.steps_left = self.data.date.shape[0] - (LOOKBACK_WINDOW_SIZE - 1)
        self.frame_start = LOOKBACK_WINDOW_SIZE
      else:
        self.steps_left = np.random.randint(1, MAX_TRADING_SESSION)
        self.frame_start = np.random.randint(LOOKBACK_WINDOW_SIZE, self.data.date.shape[0] - self.steps_left)
      
      self.active_frame = np.array([
          self.data.open[self.frame_start - LOOKBACK_WINDOW_SIZE:self.frame_start + self.steps_left],
          self.data.high[self.frame_start - LOOKBACK_WINDOW_SIZE:self.frame_start + self.steps_left],
          self.data.low[self.frame_start - LOOKBACK_WINDOW_SIZE:self.frame_start + self.steps_left],
          self.data.close[self.frame_start - LOOKBACK_WINDOW_SIZE:self.frame_start + self.steps_left],
          self.data.volume[self.frame_start - LOOKBACK_WINDOW_SIZE:self.frame_start + self.steps_left]
          ])


    def reset(self):
      self.trades = []
      self.balance = INITIAL_ACCOUNT_BALANCE
      self.net_worth = INITIAL_ACCOUNT_BALANCE
      self.max_net_worth = INITIAL_ACCOUNT_BALANCE
      self.standkeMaxBenchShares = 0
      self.shares_held  = 0
      self._reset_session()
      self.account_history = np.repeat([
          [self.net_worth/MAX_ACCOUNT_BALANCE],
          [0],
          [0],
          [0],
          [0]
          ], LOOKBACK_WINDOW_SIZE + 1, axis=1)

      return self._next_observation()  

    # shape of observation space is 2D
    @property
    def shape(self):
      return (10, LOOKBACK_WINDOW_SIZE + 1)

    def _next_observation(self):
      end = self.current_step + (LOOKBACK_WINDOW_SIZE+1)
      obs = np.array([
          self.active_frame[0][self.current_step:end],
          self.active_frame[1][self.current_step:end],
          self.active_frame[2][self.current_step:end],
          self.active_frame[3][self.current_step:end],
          self.active_frame[4][self.current_step:end] 
      ])
      obs = np.append(obs, self.account_history[:, -(LOOKBACK_WINDOW_SIZE+ 1):], axis=0) 
      print(obs)     
      return obs
       
    def _take_action(self, action):
      current_price = self._cur_close()
      action_type = action[0]
      amount = action[1]/100

      shares_bought = 0
      shares_sold = 0
      additional_cost = 0
      sales = 0


      if action_type < 1:
        # Buy amount % of balance in shares
        total_possible = self.balance / current_price
        shares_bought = total_possible * amount
        additional_cost = shares_bought * current_price * (1+self.commission)
        self.balance -= additional_cost
        self.standkeMaxBenchShares += shares_bought
        self.shares_held += shares_bought
        
        
        # visualization portion
        if shares_bought > 0:
          self.trades.append({'step': self.frame_start+self.current_step, 'shares': shares_bought, 
                              'total': additional_cost, 'type': "buy"})
          
      elif action_type < 2:
        # Sell amount % of shares held
        shares_sold = self.shares_held * amount  
        sales = shares_sold * current_price * (1 - self.commission)
        self.balance += sales
        self.standkeMaxBenchShares -= shares_sold
        

        # visualization portion
        if shares_sold > 0:
          self.trades.append({'step': self.current_step, 'shares': shares_sold, 
                                  'total': shares_sold * current_price, 'type': "sell"})  
          
      
      self.netWorth = self.balance + self.shares_held * current_price
      
      if self.netWorth > self.max_net_worth:
        self.max_net_worth = self.netWorth

      # updating account history
      self.account_history = np.append(self.account_history, [
        [self.net_worth/MAX_ACCOUNT_BALANCE],
        [shares_bought/MAX_NUM_SHARES],
        [additional_cost/MAX_SHARE_PRICE],
        [shares_sold/MAX_NUM_SHARES],
        [sales/MAX_SHARE_PRICE]
        ], axis=1)

    def _cur_close(self):
      """
      Calculate real close price for the current bar
      """
      return self.data.real_close[self.current_step]

    def step(self, action):
      reward = 0
      current_price = self._cur_close()
      # Execute one time step within the environment
      self._take_action(action)
        
      self.steps_left -= 1
      self.current_step += 1

      if self.steps_left == 0:
        self.balance += self.shares_held * current_price
        self.standkeMaxBenchShares = 0
        self._reset_session()

      if self.current_step >= self.data.close.shape[0]-1 or self.netWorth <= 0:
        done=True
      else:
        done=False
      
      prev_net_worth = self.account_history[0, 10] 
      reward = self.netWorth - prev_net_worth

      obs = self._next_observation()

      return obs, reward, done, {}

    def _render_to_file(self, filename='render.txt'):
      f = open(filename, 'a+')
      f.write(f"Step: {self.current_step}\n")
      f.write(f"Date: {self.data.date[self.current_step]}\n")
      f.write(f"Balence: {self.balance}\n")
      f.write(f"Reward: {self.track_reward}\n")
      f.write(f"Amount Held: {self.shares_held}\n") 
      f.write(f"Amount Sold: {self.total_shares_sold}\n)")
      #add some more
      f.close()

    def render(self, mode='file', title="Agent's Trading Screen", **kwargs):
      # Render the environment to the screen
      if mode == 'file':
        self._render_to_file(kwargs.get('filename', 'render.txt'))
      elif mode == 'live':
        if self.visualization == None:
          self.visualization = StockTradingGraph(self.data, title)
        if self.current_step > LOOKBACK_WINDOW_SIZE:
          self.visualization.render(self.current_step, self.netWorth,
                                    self.trades, window_size=LOOKBACK_WINDOW_SIZE)


    def seed(self, seed=None):
      self.np_random, seed1 = seeding.np_random(seed)
      seed2 = seeding.hash_seed(seed1+1) % 2**33
      return [seed1, seed2]


In [8]:
# using sklearn's min-max scaler for the relative high and low
x=preprocessing.MinMaxScaler()

# taken from https://machinelearningmastery.com/remove-trends-seasonality-difference-transform-python/
# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = np.log(dataset[i]) - np.log(dataset[i - interval])
		diff.append(value)
	return diff
 

# training data
df = pd.read_csv('/content/drive/MyDrive/Datasets/StockMarketData/archive/Data/ETFs/spy.us.txt')
df = df.sort_values('Date')
data=df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]

# making OHLC data stationary before calculating relative and normalizing 
diff_o = np.array(difference(data['Open'], 1))
diff_h = np.array(difference(data['High'], 1))
diff_l = np.array(difference(data['Low'], 1))
diff_c = np.array(difference(data['Close'], 1))
# volumne data
vol = data['Volume'].values/MAX_NUM_SHARES
# year data of year-month-day form
dt = data['Date'].array
# calculating relative prices and normalizing data
o =  (diff_o-diff_l)/(diff_h-diff_l)
o =  x.fit_transform(o.reshape(-1,1)).reshape(-1)
rc = (diff_c-diff_l)/(diff_h-diff_l)
rc = x.fit_transform(rc.reshape(-1,1)).reshape(-1)
rh = (diff_h-diff_o/diff_o)
rh = x.fit_transform(rh.reshape(-1,1)).reshape(-1)
rl = (diff_l-diff_o/diff_o)
rl = x.fit_transform(rl.reshape(-1,1)).reshape(-1)

Data = collections.namedtuple('Data', field_names=['date','high', 'low', 'close', 'open', 'volume', 'real_close', 'real_high', 'real_low'])
data=Data(date=dt,high=rh, low=rl, close=rc, open=o, volume=vol, real_close=data['Close'].values, real_high=data['High'].values, real_low=data['Low'].values)

In [9]:
# Testing data
test = pd.read_csv('/content/drive/MyDrive/Datasets/StockMarketData/test.csv')
t_df = test.sort_values('Date')
data_two=t_df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]

# making OHLC data stationary before calculating relative and normalizing 
diff_o = np.array(difference(data_two['Open'], 1))
diff_h = np.array(difference(data_two['High'], 1))
diff_l = np.array(difference(data_two['Low'], 1))
diff_c = np.array(difference(data_two['Close'], 1))
# volumne data
vol = data_two['Volume'].values/MAX_NUM_SHARES
# year data of year-month-day form
dt = data_two['Date'].array
# calculating relative prices and normalizing data
o =  (diff_o-diff_l)/(diff_h-diff_l)
o =  x.fit_transform(o.reshape(-1,1)).reshape(-1)
rc = (diff_c-diff_l)/(diff_h-diff_l)
rc = x.fit_transform(rc.reshape(-1,1)).reshape(-1)
rh = (diff_h-diff_o/diff_o)
rh = x.fit_transform(rh.reshape(-1,1)).reshape(-1)
rl = (diff_l-diff_o/diff_o)
rl = x.fit_transform(rl.reshape(-1,1)).reshape(-1)

Data_two = collections.namedtuple('Data', field_names=['date','high', 'low', 'close', 'open', 'volume', 'real_close', 'real_high', 'real_low'])
test_data=Data_two(date=dt,high=rh, low=rl, close=rc, open=o, volume=vol, real_close=data_two['Close'].values, real_high=data_two['High'].values, real_low=data_two['Low'].values)

# Creating Custom CNN Feature Extractor

In [10]:
class CustomCNN(BaseFeaturesExtractor):
  def __init__(self, observation_space: gym.spaces.Box, features_dim: int = 256):
        super(CustomCNN, self).__init__(observation_space, features_dim)
        # We assume HxW images
        # Re-ordering will be done by pre-preprocessing or wrapper
        input = observation_space.shape[0]
        self.cnn = nn.Sequential(
            nn.Conv1d(input, 128, kernel_size=2),
            nn.ReLU(),
            nn.Conv1d(128, 64, kernel_size=4),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]

        self.linear = nn.Sequential(nn.Linear(n_flatten, features_dim), nn.ReLU())
        
  def forward(self, observations):
    return self.linear(self.cnn(observations))

# additional keyword parameters to pass to model net_arch=[dict(pi=[256], vf=[256])]
policy_kwargs = dict(
    features_extractor_class=CustomCNN,
    features_extractor_kwargs=dict(features_dim=128)
)

# Training and Validation Portion

In [16]:
# number of learning steps to train RL model is set to 200K
MAX_STEPS = 1e6
# number of epoch when optimizing the surrogate loss
EPOCHS = 10
# entropy coefficient for the loss calculation
E_COEF = 0.0
# limit the KL divergence between updates 
T_KL = None
# minibatch size
BATCH = 64
# the number of steps to run for each environment per update 
# (i.e. rollout buffer size is n_steps * n_envs where n_envs is 
# number of environment copies running in parallel) NOTE: n_steps * n_envs 
# must be greater than 1 (because of the advantage normalization)
N_STEPS=20
# value function coefficient for the loss calculation
VF_COEF= 0.5
# the number of parallel environments 
ENV = 5

# create evaluation env that takes in test data for validation
eval_env = DummyVecEnv([lambda: StockTradingEnv(test_data, random_ofs_on_reset=False)])
# use deterministic actions for evaluation callback
eval_callback = EvalCallback(eval_env, best_model_save_path='/content/drive/MyDrive/RLmodels/bestPPO/',
                             log_path='/content/drive/MyDrive/RLmodels/logs/', eval_freq=MAX_STEPS/100,
                             deterministic=False, render=False)

# create training envs that takes in training data for training
envs =  DummyVecEnv([lambda: StockTradingEnv(data, random_ofs_on_reset=True) for _ in range(0,ENV)])


# training model using CNNPolicy
model = PPO("CnnPolicy", envs, n_steps=N_STEPS, n_epochs=EPOCHS, verbose=1, ent_coef=E_COEF,vf_coef=VF_COEF, target_kl=T_KL, batch_size=BATCH, tensorboard_log="/content/PPO_SPY_tensorboard/", policy_kwargs=policy_kwargs)

Using cpu device


In [17]:
model.learn(total_timesteps=MAX_STEPS, callback=eval_callback)

[[4.18235406e-01 4.18771742e-01 4.19233506e-01 4.19167315e-01
  4.18127976e-01 4.18869009e-01 4.18785933e-01 4.21122846e-01
  4.19154856e-01 4.19173122e-01 4.19264529e-01]
 [5.14460213e-01 4.90015197e-01 5.09257133e-01 4.42470526e-01
  5.40666020e-01 5.44046117e-01 5.03003626e-01 4.92130655e-01
  5.20075597e-01 4.94730687e-01 5.16424446e-01]
 [5.46515184e-01 5.25922576e-01 5.02608916e-01 5.29291815e-01
  5.59108679e-01 5.75737910e-01 5.60141640e-01 5.30680951e-01
  4.93377528e-01 5.84616879e-01 5.30677918e-01]
 [3.99752570e-01 3.99923132e-01 3.99563594e-01 3.99648311e-01
  4.00030306e-01 3.99878040e-01 3.99739611e-01 3.97879700e-01
  3.99777041e-01 3.99782911e-01 3.99613905e-01]
 [3.05134687e-02 2.97708470e-02 3.77633204e-02 4.22739336e-02
  3.50988219e-02 3.42704361e-02 3.53121609e-02 2.07008184e-02
  2.12974576e-02 3.04277283e-02 3.42069999e-02]
 [4.65661288e-06 4.65661288e-06 4.65661288e-06 4.65661288e-06
  4.65661288e-06 4.65661288e-06 4.65661288e-06 4.65661288e-06
  4.65661288e-06

ValueError: ignored

# Prediction and Rendering Environment Portion

In [None]:
model = PPO.load("/content/drive/MyDrive/RLmodels/bestPPO/best_model.zip")
env = StockTradingEnv(test_data, random_ofs_on_reset=False)
obs = env.reset()
for i in range(len(test_data.date)):
  action, _states = model.predict(obs, deterministic=False)
  obs, rewards, done, info = env.step(action)
  env.render()
  if done:
    break

In [None]:
# taken from https://stackoverflow.com/questions/5967500/how-to-correctly-sort-a-string-with-a-number-inside

def atoi(text):
    return int(text) if text.isdigit() else text

def natural_keys(text):
    '''
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)
    '''
    return [ atoi(c) for c in re.split(r'(\d+)', text) ]

list_of_files = [img for img in os.listdir('/content') if img.endswith(".png")]
list_of_files.sort(key=natural_keys)


In [None]:
# taken from https://stackoverflow.com/questions/44947505/how-to-make-a-movie-out-of-images-in-python
clip = moviepy.video.io.ImageSequenceClip.ImageSequenceClip(list_of_files, fps=fps)
clip.write_videofile('agent_trading.mp4')

In [None]:
# taken from https://colab.research.google.com/drive/1flu31ulJlgiRL1dnN2ir8wGh9p7Zij2t#scrollTo=8nj5sjsk15IT

def show_video():
  mp4list = glob.glob('agent_trading.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")

In [None]:
show_video()

In [None]:
!rm -r *.png

In [None]:
%tensorflow_version 2
%load_ext tensorboard
%tensorboard --logdir /content/PPO_SPY_tensorboard/ 