<a href="https://colab.research.google.com/github/aCStandke/ReinforcementLearning/blob/main/ThirdStockEnivornment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Third Stock Trading Environment


  This third stock trading environment is based on Adam King's article as found here:[Creating Bitcoin trading bots don’t lose money](https://medium.com/towards-data-science/creating-bitcoin-trading-bots-that-dont-lose-money-2e7165fb0b29). Similar to the first stock trading environment based on Maxim Lapan's implementation as found in chapter eight of his book [Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition](https://www.amazon.com/Deep-Reinforcement-Learning-Hands-optimization/dp/1838826998), the agent is trading in the environment of the [SPY ETF](https://www.etf.com/SPY?L=1) except in this trading environment the agent is tasked with two discrete actions of not only buying, selling or holding shares but also tasked with determining the amount to buy/sell ranging from 1 to 100 (which will be converted into pecentage form i.e. 1/100=1%, 100/100=100%) based on its trading account/balance [trading account](https://www.investopedia.com/terms/t/tradingaccount.asp#:~:text=A%20trading%20account%20is%20an,margin%20requirements%20set%20by%20FINRA.).  


In [1]:
# ignore warning messages because they are annoying lol
import warnings
warnings.filterwarnings('ignore')

# Installing Necessary Package for Training the Trading Agent

To train the Trading Agent the package [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/index.html) was used. As stated in the docs: 
> Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines. And steems from the paper [Stable-Baselines3: Reliable Reinforcement Learning Implementations](https://jmlr.org/papers/volume22/20-1364/20-1364.pdf)
The algorithms in this package will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of. We expect these tools will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. We also hope that the simplicity of these tools will allow beginners to experiment with a more advanced toolset, without being buried in implementation details.

---
## Proximal Policy Optimization(PPO):

Because in this environment the Agent will be executing continous actions, the Proximal Policy Optimization(PPO) algorithm was chosen. As detailed by the authors [PPO](https://arxiv.org/pdf/1707.06347.pdf)


> We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).


PPO uses the following novel objective function:

$L^{CLIP}(θ)=\hat{E}_t[min(r_{t}(θ)\hat{A}_t,clip(r_{t}(θ), 1-ϵ, 1+ϵ)\hat{A}_t]$

*  $\theta$ is the policy parameter
*  $\hat{E}_t$ denotes the empirical expectation over timesteps
*  $r_{t}$ is the ratio of the probability under the new and old policies, respectively
*  $\hat{A}_t$ is the estimated advantage at time t
*  $\epsilon$ is the clipping hyperparameter, usually 0.1 or 0.2


As detailed by the authors [openAI](https://openai.com/blog/openai-baselines-ppo/#ppo)


> This objective implements a way to do a Trust Region update which is compatible with Stochastic Gradient Descent, and simplifies the algorithm by removing the KL penalty and need to make adaptive updates. In tests, this algorithm has displayed the best performance on continuous control tasks and almost matches ACER’s performance on Atari, despite being far simpler to implement


  









In [2]:
!pip install stable-baselines3[extra]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stable-baselines3[extra]
  Downloading stable_baselines3-1.6.0-py3-none-any.whl (177 kB)
[K     |████████████████████████████████| 177 kB 15.2 MB/s 
Collecting gym==0.21
  Downloading gym-0.21.0.tar.gz (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 61.9 MB/s 
Collecting protobuf~=3.19.0
  Downloading protobuf-3.19.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 74.9 MB/s 
Collecting ale-py==0.7.4
  Downloading ale_py-0.7.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 58.0 MB/s 
Collecting autorom[accept-rom-license]~=0.4.2
  Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB)
Collecting AutoROM.accept-rom-license
  Downloading AutoROM.accept-rom-license-0.4.2.tar.gz (9.8 kB)
  Installing build dependencies ... [?25l[?25hdon

# Installing the Necessary Packages for Visualizing the Trading Agent's Envirnoment on Google Colab Notebooks

In [3]:
!pip install mpl_finance #used for plotting the candelstick graph
!pip install moviepy #
!pip install imageio_ffmpeg #
!pip install pyvirtualdisplay > /dev/null 2>&1 #used to create a display for vm
!apt-get install x11-utils > /dev/null 2>&1 #
!pip install pyglet==v1.3.2 > /dev/null 2>&1 #
!apt-get install -y xvfb python-opengl > /dev/null 2>&1 #

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mpl_finance
  Downloading mpl_finance-0.10.1-py3-none-any.whl (8.4 kB)
Installing collected packages: mpl-finance
Successfully installed mpl-finance-0.10.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting imageio_ffmpeg
  Downloading imageio_ffmpeg-0.4.7-py3-none-manylinux2010_x86_64.whl (26.9 MB)
[K     |████████████████████████████████| 26.9 MB 1.4 MB/s 
[?25hInstalling collected packages: imageio-ffmpeg
Successfully installed imageio-ffmpeg-0.4.7


In [4]:
import random
import json
import gym
from gym import spaces
from gym.utils import seeding
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib import style
import json
import datetime as dt
from typing import Callable, Dict, List, Optional, Tuple, Type, Union
from stable_baselines3 import PPO
from stable_baselines3.common.utils import constant_fn
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.env_util import DummyVecEnv
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_checker import VecCheckNan, check_env
import torch as th
import torch.nn as nn
import collections
import datetime
from sklearn import preprocessing
from mpl_finance import candlestick_ochl as candlestick
import math
import os
import moviepy.video.io.ImageSequenceClip
import glob
import re
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display

In [5]:
# stock environment parameters
MAX_ACCOUNT_BALANCE = 2147483647
MAX_NUM_SHARES = 2147483647
MAX_SHARE_PRICE = 4294967295
LOOKBACK_WINDOW_SIZE = 5
MAX_STEPS = 20000
INITIAL_ACCOUNT_BALANCE = 10000
# default percentage of stock price trading agent pays broker when 
# buying/selling, default is 0.1% (i.e. very reasonable)
DA_COMMISION = 0.1


# Visualization Parameters
style.use('dark_background')
VOLUME_CHART_HEIGHT = 0.33
UP_COLOR = '#27A59A'
DOWN_COLOR = '#EF534F'
UP_TEXT_COLOR = '#73D3CC'
DOWN_TEXT_COLOR = '#DC2C27'


# Video Parameters
fps=1
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7fe4b1c7aad0>

 ## Creating visualization for Stock/ETF Environment
 As detalied by Adam King in his article titled
 [Rendering elegant stock trading agents using Matplotlib 
 and Gym](https://towardsdatascience.com/visualizing-stock-trading-agents-using-matplotlib-and-gym-584c992bc6d4)

In [6]:
def date2num(date):
  converter = mdates.strpdate2num('%Y-%m-%d')
  return converter(date)

class StockTradingGraph:
  """A stock trading visualization using matplotlib made to render 
    OpenAI gym environments"""
  def __init__(self, df, title=None):
    self.df = df
    self.net_worths = np.zeros(len(df.close))
    self.count = 0
    self._step = 0
    

    # Create a figure on screen and set the title
    fig = plt.figure()
    fig.suptitle(title)
    # Create top subplot for net worth axis
    self.net_worth_ax = plt.subplot2grid(shape=(6, 1), loc=(0, 0), rowspan=2,     
      colspan=1)
  
    # Create bottom subplot for shared price/volume axis
    self.price_ax = plt.subplot2grid(shape=(6, 1), loc=(2, 0), rowspan=8, 
      colspan=1, sharex=self.net_worth_ax)
    # Create a new axis for volume which shares its x-axis with price
    self.volume_ax = self.price_ax.twinx()
    # Add padding to make graph easier to view
    plt.subplots_adjust(left=0.11, bottom=0.24, right=0.90, 
      top=0.90, wspace=0.2, hspace=0)
    
  
  def render(self, current_step, net_worth, trades, window_size=40):
    self.net_worths[current_step] = net_worth
    window_start = max(current_step - window_size, 0)
    step_range = range(window_start, current_step + 1)
    # Format dates as timestamps, necessary for candlestick graph
    dates = np.array([date2num(x) for x in self.df.date[step_range]])

    self._render_net_worth(current_step, net_worth, step_range, dates)
    self._render_price(current_step, net_worth, dates, step_range)
    self._render_volume(current_step, net_worth, dates, step_range)
    self._render_trades(current_step, trades, step_range)
        
    # Format the date ticks to be more easily read
    self.price_ax.set_xticklabels(self.df.date[step_range], rotation=45, ha='right')
        
    # Hide duplicate net worth date labels
    plt.setp(self.net_worth_ax.get_xticklabels(), visible=False)
    # Necessary to view frames before they are unrendered  
    if self.count < 1:
      plt.savefig(f'{current_step}.png')
      self._step = current_step
      self.count += 1
    else: 
      if self._step + LOOKBACK_WINDOW_SIZE==current_step: 
        plt.savefig(f'{current_step}.png')
        self._step = current_step

  def _render_net_worth(self, current_step, net_worth, step_range, dates):
    # Clear the frame rendered last step
    self.net_worth_ax.clear()

    # Plot net worths
    self.net_worth_ax.plot_date(dates, self.net_worths[step_range],
                                '-', label='Net Worth')

    # Show legend, which uses the label we defined for the plot above
    self.net_worth_ax.legend()
    legend = self.net_worth_ax.legend(loc=2, ncol=2, prop={'size': 8})
    legend.get_frame().set_alpha(0.4)

    last_date = date2num(self.df.date[current_step])
    last_net_worth = self.net_worths[current_step]

    # Annotate the current net worth on the net worth graph
    self.net_worth_ax.annotate('{0:.2f}'.format(net_worth), (last_date, last_net_worth),
                                   xytext=(last_date, last_net_worth),
                                   bbox=dict(boxstyle='round',
                                             fc='w', ec='k', lw=1),
                                   color="black",
                                   fontsize="small")

    # Add space above and below min/max net worth
    self.net_worth_ax.set_ylim(
            min(self.net_worths[np.nonzero(self.net_worths)]) / 1.25, max(self.net_worths) * 1.25)
    

  def _render_price(self, current_step, net_worth, dates, step_range):
        self.price_ax.clear()

        # Format data for OHCL candlestick graph
        candlesticks = zip(dates,
                           self._cur_open(step_range), self._cur_close(step_range),
                           self._cur_high(step_range), self._cur_low(step_range))

        # Plot price using candlestick graph from mpl_finance
        candlestick(self.price_ax, candlesticks, width=0.5,
                    colorup=UP_COLOR, colordown=DOWN_COLOR)

        last_date = date2num(self.df.date[current_step])
        last_close = self._cur_close(current_step)
        last_high = self._cur_high(current_step)

        # Print the current price to the price axis
        self.price_ax.annotate('{0:.2f}'.format(last_close), (last_date, last_close),
                               xytext=(last_date, last_high),
                               bbox=dict(boxstyle='round',
                                         fc='w', ec='k', lw=1),
                               color="black",
                               fontsize="small")

        # Shift price axis up to give volume chart space
        ylim = self.price_ax.get_ylim()
        self.price_ax.set_ylim(ylim[0] - (ylim[1] - ylim[0])
                               * VOLUME_CHART_HEIGHT, ylim[1])
        
  def _render_volume(self, current_step, net_worth, dates, step_range):
        self.volume_ax.clear()

        volume = np.array(self._cur_vol(step_range))

        pos = self._cur_open(step_range) - \
            self._cur_close(step_range) < 0
        neg = self._cur_open(step_range)- \
            self._cur_close(step_range) > 0


        # Color volume bars based on price direction on that date
        self.volume_ax.bar(dates[pos], volume[pos], color=UP_COLOR,
                           alpha=0.4, width=0.5, align='center')
        self.volume_ax.bar(dates[neg], volume[neg], color=DOWN_COLOR,
                           alpha=0.4, width=0.5, align='center')

        # Cap volume axis height below price chart and hide ticks
        self.volume_ax.set_ylim(0, max(volume) / VOLUME_CHART_HEIGHT)
        self.volume_ax.yaxis.set_ticks([])
       

  def _render_trades(self, current_step, trades, step_range):
    for trade in trades:
      if trade['step'] in step_range:
        date = date2num(self.df.date[trade['step']])
        high = self._cur_high(trade['step'])
        low = self._cur_low(trade['step'])
         
        if trade['type'] == 'buy':
          high_low = low
          color = UP_TEXT_COLOR
        else:
          high_low = high
          color = DOWN_TEXT_COLOR
        
        total = '{0:.2f}'.format(trade['total'])
        
        # Print the current price to the price axis
        self.price_ax.annotate(f'${total}', (date, high_low),
                             xytext=(date, high_low),
                             color=color,
                             fontsize=8,
                             arrowprops=(dict(color=color)))
    
  def _cur_open(self, offset):
      """
      Calculate real open price for the current bar
      """
      return self.df.real_open[offset]              
  
  def _cur_close(self, offset):
      """
      Calculate real close price for the current bar
      """
      return self.df.real_close[offset]

  def _cur_high(self, offset):
      """
      Calculate real high price for the current bar
      """
      return self.df.real_high[offset]

  def _cur_low(self, offset):
      """
      Calculate real low price for the current bar
      """
      return self.df.real_low[offset]
  
  def _cur_vol(self, offset):
      """
      Calculate real volume for the current bar
      """
      return self.df.real_vol[offset]

  def close(self):
    plt.close()
              
  

In [24]:
# Stock/ETF Trading Enviornment
class StockTradingEnv(gym.Env):
    """A stock trading environment for OpenAI gym"""
    metadata = {'render.modes': ['human']}

    def __init__(self, data, random_ofs_on_reset=True):
        super(StockTradingEnv, self).__init__()
        self.data = data
        self.scale = preprocessing.MinMaxScaler()
        self.random_ofs_on_reset = random_ofs_on_reset
        self.visualization = None 
        self.bars_count = LOOKBACK_WINDOW_SIZE
        self.commission = DA_COMMISION

        # Actions of the format Buy x%, Sell x%, Hold, etc.
        self.action_space = spaces.Box(low=np.array([0, 0]), high=np.array([3, 1]), dtype=np.float32)

        # Prices contains the OHCL values for the last five prices
        self.observation_space = spaces.Box(
            low=0, high=1, shape=self.shape, dtype=np.float32)
        
        self.seed()

    def reset(self):
      # random offset portion 
      bars = self.bars_count
      if self.random_ofs_on_reset:
        offset = self.np_random.choice(self.data.high.shape[0]-bars*10)+bars
      else:
        offset = bars
      self._reset(offset)
      return self._next_observation()

    def _reset(self, offset):
      self.trades = []
      self.balance = INITIAL_ACCOUNT_BALANCE
      self.netWorth = INITIAL_ACCOUNT_BALANCE
      self.max_net_worth = INITIAL_ACCOUNT_BALANCE
      self.standkeMaxBenchShares = 0
      self.shares_held  = 0
      self.total_shares_sold = 0
      self.track_reward = 0
      self._offset = offset
      # setting account history portion
      self.account_history = np.repeat([
            [self.netWorth/MAX_ACCOUNT_BALANCE],
            [0],
            [0],
            [0],
            [0]
            ], LOOKBACK_WINDOW_SIZE, axis=1) 
      self.net_track = [self.netWorth]

    # shape of observation space is 2D
    @property
    def shape(self):
      return (10, self.bars_count)

    def _next_observation(self):
      res = np.zeros(shape=(6, self.bars_count), dtype=np.float16)
      ofs = self.bars_count-1
      res[0] = self.data.open[self._offset-ofs:self._offset+1]
      res[1] = self.data.high[self._offset-ofs:self._offset+1]
      res[2] = self.data.low[self._offset-ofs:self._offset+1]
      res[3] = self.data.close[self._offset-ofs:self._offset+1]
      res[4] = self.data.volume[self._offset-ofs:self._offset+1]
      res[5] = self.account_history[0][-self.bars_count:]
      scaled = self.scale.fit_transform(self.account_history[1:, :])
      res = np.append(res, scaled,axis=0)
      res = np.float16(res)
      return res
       
    def _take_action(self, action):
      current_price = self._cur_close()
      action_type = action[0]
      amount = action[1]

      shares_bought = 0
      shares_sold = 0
      additional_cost = 0
      sales = 0


      if action_type < 1 :
        # Buy amount % of balance in shares
        total_possible = self.balance / current_price
        shares_bought = total_possible * amount
        additional_cost = shares_bought * current_price * (1+self.commission)
        self.balance -= additional_cost
        self.standkeMaxBenchShares += shares_bought
        self.shares_held += shares_bought
        
        
        # visualization portion
        if shares_bought > 0:
          self.trades.append({'step': self._offset, 'shares': shares_bought, 
                              'total': shares_bought * current_price, 'type': "buy"})
          
      elif action_type < 2:
        # Sell amount % of shares held
        shares_sold = self.shares_held * amount  
        sales = shares_sold * current_price * (1 - self.commission)
        self.balance += sales
        self.standkeMaxBenchShares -= shares_sold
        self.total_shares_sold += shares_sold
        

        # visualization portion
        if shares_sold > 0:
          self.trades.append({'step': self._offset, 'shares': shares_sold, 
                                  'total': shares_sold * current_price, 'type': "sell"})  
          
      
      self.netWorth = self.balance + self.shares_held * current_price
      
      if self.netWorth > self.max_net_worth:
        self.max_net_worth = self.netWorth

      # updating account history
      self.account_history = np.append(self.account_history, [
        [self.netWorth/MAX_ACCOUNT_BALANCE],
        [shares_bought],
        [additional_cost],
        [shares_sold],
        [sales]
        ], axis=1)[:, -self.bars_count:]

      self.net_track.append(self.netWorth)
      
      
    def _cur_close(self):
      """
      Calculate real close price for the current bar
      """
      return self.data.real_close[self._offset]

    def step(self, action):
      reward = 0
      current_price = self._cur_close()
      # Execute one time step within the environment
      self._take_action(action)
    
      self._offset += 1

      if self._offset >= self.data.close.shape[0]-1 or self.netWorth <= 0 or self.netWorth>=MAX_ACCOUNT_BALANCE:
        done=True
      else:
        done=False
  
      obs = self._next_observation()

      prev_netWorth = self.net_track[-2]
      current_netWorth = self.net_track[-1]
      reward = int(np.log(current_netWorth))/int(np.log(prev_netWorth))
      self.track_reward += reward
      
      return obs, reward, done, {}

    def _render_to_file(self, filename='render.txt'):
      f = open(filename, 'a+')
      f.write(f"Step: {self._offset}\n")
      f.write(f"Date: {self.data.date[self._offset]}\n")
      f.write(f"Net Worth: {self.netWorth}\n")
      f.write(f"Reward: {self.track_reward}\n")
      f.write(f"Amount Held: {self.shares_held}\n")
      f.write(f"Amount Sold: {self.total_shares_sold}\n")
      #add some more
      f.close()

    def render(self, mode='file', title="Agent's Trading Screen", **kwargs):
      # Render the environment to the screen
      if mode == 'file':
        self._render_to_file(kwargs.get('filename', 'render.txt'))
      elif mode == 'live':
        if self.visualization == None:
          self.visualization = StockTradingGraph(self.data, title)
        if self._offset > LOOKBACK_WINDOW_SIZE:
          self.visualization.render(self._offset, self.netWorth,
                                    self.trades, window_size=LOOKBACK_WINDOW_SIZE)


    def seed(self, seed=None):
      self.np_random, seed1 = seeding.np_random(seed)
      seed2 = seeding.hash_seed(seed1+1) % 2**33
      return [seed1, seed2]


In [8]:
# using sklearn's min-max scaler for the relative high and low
x=preprocessing.MinMaxScaler()

# taken from https://machinelearningmastery.com/remove-trends-seasonality-difference-transform-python/
# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = np.log(dataset[i]) - np.log(dataset[i - interval])
		diff.append(value)
	return diff
 
# training data
df = pd.read_csv('/content/drive/MyDrive/Datasets/StockMarketData/archive/Data/ETFs/spy.us.txt')
df = df.sort_values('Date')
data=df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]

# making OHLC data stationary before calculating relative and normalizing 
diff_o = np.array(difference(data['Open'], 1))
diff_h = np.array(difference(data['High'], 1))
diff_l = np.array(difference(data['Low'], 1))
diff_c = np.array(difference(data['Close'], 1))
# volumne data
vol = data['Volume'].values/MAX_NUM_SHARES
# year data of year-month-day form
dt = data['Date'].array
# calculating relative prices and normalizing data
o =  (diff_o-diff_l)/(diff_h-diff_l)
o =  x.fit_transform(o.reshape(-1,1)).reshape(-1)
rc = (diff_c-diff_l)/(diff_h-diff_l)
rc = x.fit_transform(rc.reshape(-1,1)).reshape(-1)

rh = x.fit_transform(diff_h.reshape(-1,1)).reshape(-1)
rl = x.fit_transform(diff_l.reshape(-1,1)).reshape(-1)

Train_Data = collections.namedtuple('Data', field_names=['date','high', 'low', 'close', 'open', 'volume', 'real_open',  'real_close', 'real_high', 'real_low', 'real_vol'])
train = Train_Data(date=dt,high=rh, low=rl, close=rc, open=o, volume=vol, real_open=data['Open'].values, real_close=data['Close'].values, real_high=data['High'].values, real_low=data['Low'].values, real_vol=data['Volume'].values)

In [9]:
# Testing data
test = pd.read_csv('/content/drive/MyDrive/Datasets/StockMarketData/test.csv')
t_df = test.sort_values('Date')
data_two=t_df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]

# making OHLC data stationary before calculating relative and normalizing 
diff_o = np.array(difference(data_two['Open'], 1))
diff_h = np.array(difference(data_two['High'], 1))
diff_l = np.array(difference(data_two['Low'], 1))
diff_c = np.array(difference(data_two['Close'], 1))
# volumne data
vol = data_two['Volume'].values/MAX_NUM_SHARES
# year data of year-month-day form
dt = data_two['Date'].array
# calculating relative prices and normalizing data
o =  (diff_o-diff_l)/(diff_h-diff_l)
o =  x.fit_transform(o.reshape(-1,1)).reshape(-1)
rc = (diff_c-diff_l)/(diff_h-diff_l)
rc = x.fit_transform(rc.reshape(-1,1)).reshape(-1)

rh = x.fit_transform(diff_h.reshape(-1,1)).reshape(-1)
rl = x.fit_transform(diff_l.reshape(-1,1)).reshape(-1)

Test_Data = collections.namedtuple('Data', field_names=['date','high', 'low', 'close', 'open', 'volume', 'real_open', 'real_close', 'real_high', 'real_low', 'real_vol'])
test = Test_Data(date=dt,high=rh, low=rl, close=rc, open=o, volume=vol, real_open=data['Open'].values, real_close=data_two['Close'].values, real_high=data_two['High'].values, real_low=data_two['Low'].values, real_vol=data['Volume'].values)

# Creating the Standke Policy/Value Network Class

In [27]:
class StandkeExtractor(BaseFeaturesExtractor):
  def __init__(self, observation_space=gym.spaces.Box, features_dim=128):
        super(StandkeExtractor, self).__init__(observation_space, features_dim)
        
        input = observation_space.shape[0]

        # Feature Extractor
        self.cnn = nn.Sequential(
            nn.Conv1d(input, 128, kernel_size=5, padding=3),
            nn.ReLU(),
            nn.Conv1d(128, 128, kernel_size=5, padding=3),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]

        
  def forward(self, observations):
    return self.cnn(observations)


class StandkeNetwork(nn.Module):
  def __init__(self,feature_dim=1152, last_layer_dim_pi= 64, last_layer_dim_vf= 64):
        super(StandkeNetwork, self).__init__()

        # IMPORTANT:
        # Save output dimensions, used to create the distributions
        self.latent_dim_pi = last_layer_dim_pi
        self.latent_dim_vf = last_layer_dim_vf

         # Policy Network
        self.policy_net = nn.Sequential(
            nn.Linear(feature_dim,last_layer_dim_pi),
            nn.ReLU(),  
        )

         # Value Network
        self.value_net = nn.Sequential(
            nn.Linear(feature_dim, last_layer_dim_vf), 
            nn.ReLU(), 
        )

  def forward_actor(self, features: th.Tensor) -> th.Tensor:
        return self.policy_net(features)

  def forward_critic(self, features: th.Tensor) -> th.Tensor:
        return self.value_net(features)
  

class StandkePolicy(ActorCriticPolicy):
  def __init__(self,observation_space=gym.spaces.Box,action_space=gym.spaces.Box, lr_schedule=constant_fn(0.0003), activation_fn=nn.ReLU,*args,**kwargs):
    
        super(StandkePolicy, self).__init__(observation_space,action_space, lr_schedule, activation_fn,*args,**kwargs)
        
        # non-shared features extractors for the actor and the critic
        self.policy_features_extractor = StandkeExtractor(observation_space)
        self.value_features_extractor = StandkeExtractor(observation_space)
        delattr(self, "features_extractor")  # remove the shared features extractor

        # Disable orthogonal initialization
        self.ortho_init = False

  def _build_mlp_extractor(self):
    self.mlp_extractor = StandkeNetwork()

  def extract_features(self, obs: th.Tensor):
    policy_features = self.policy_features_extractor(obs)
    value_features = self.value_features_extractor(obs)
    return policy_features, value_features
  
  def forward(self, obs: th.Tensor, deterministic: bool = False): 
    policy_features, value_features = self.extract_features(obs)
    latent_pi = self.mlp_extractor.forward_actor(policy_features)
    latent_vf = self.mlp_extractor.forward_critic(value_features)

    # Evaluate the values for the given observations
    values = self.value_net(latent_vf)
    distribution = self._get_action_dist_from_latent(latent_pi)
    actions = distribution.get_actions(deterministic=deterministic)
    log_prob = distribution.log_prob(actions)
    return actions, values, log_prob

  def evaluate_actions(self, obs: th.Tensor, actions: th.Tensor): 
    policy_features, value_features = self.extract_features(obs)
    latent_pi = self.mlp_extractor.forward_actor(policy_features)
    latent_vf = self.mlp_extractor.forward_critic(value_features)
    distribution = self._get_action_dist_from_latent(latent_pi)
    log_prob = distribution.log_prob(actions)
    values = self.value_net(latent_vf)
    return values, log_prob, distribution.entropy()

  def get_distribution(self, obs: th.Tensor):
    policy_features, _ = self.extract_features(obs)
    latent_pi = self.mlp_extractor.forward_actor(policy_features)
    return self._get_action_dist_from_latent(latent_pi)

  def predict_values(self, obs: th.Tensor):
    _, value_features = self.extract_features(obs)
    latent_vf = self.mlp_extractor.forward_critic(value_features)
    return self.value_net(latent_vf)

# Training and Validation Portion

In [28]:
# number of learning steps to train RL model is set to 100K
MAX_STEPS = 1e4
# number of epoch when optimizing the surrogate loss
EPOCHS = 10
# entropy coefficient for the loss calculation
E_COEF = 0.0
# limit the KL divergence between updates 
T_KL = None
# minibatch size
BATCH = 64
# the number of steps to run for each environment per update 
# (i.e. rollout buffer size is n_steps * n_envs where n_envs is 
# number of environment copies running in parallel) NOTE: n_steps * n_envs 
# must be greater than 1 (because of the advantage normalization)
N_STEPS=2033
# value function coefficient for the loss calculation
VF_COEF= 0.5
# the number of parallel environments 
ENV = 1

# create evaluation env that takes in test data for validation
eval_env = DummyVecEnv([lambda: StockTradingEnv(test, random_ofs_on_reset=True)])
# use deterministic actions for evaluation callback
eval_callback = EvalCallback(eval_env, best_model_save_path='/content/drive/MyDrive/RLmodels/bestPPO/',
                             log_path='/content/drive/MyDrive/RLmodels/logs/', eval_freq=MAX_STEPS/10,
                             deterministic=False, render=False)

# create training envs that takes in training data for training
envs =  DummyVecEnv([lambda: StockTradingEnv(train, random_ofs_on_reset=True) for _ in range(0,ENV)])


# additional keyword parameters to pass to model 
policy_kwargs = dict()
# training model using the Standke Policy
model = PPO(StandkePolicy, envs, verbose=1, tensorboard_log="/content/PPO_SPY_tensorboard/", policy_kwargs=policy_kwargs)
check_env(StockTradingEnv(train, random_ofs_on_reset=True))
VecCheckNan(envs, raise_exception=True, check_inf=True)

Using cuda device


<stable_baselines3.common.vec_env.vec_check_nan.VecCheckNan at 0x7fe34e33b3d0>

# General explanation of log output 

As detailed by araffin in his commit [Add explanation of logger output](https://github.com/DLR-RM/stable-baselines3/pull/803/files), for a given log block such as

```
-----------------------------------------
  | eval/                   |             |
  |    mean_ep_length       | 200         |
  |    mean_reward          | -157        |
  | rollout/                |             |
  |    ep_len_mean          | 200         |
  |    ep_rew_mean          | -227        |
  | time/                   |             |
  |    fps                  | 972         |
  |    iterations           | 19          |
  |    time_elapsed         | 80          |
  |    total_timesteps      | 77824       |
  | train/                  |             |
  |    approx_kl            | 0.037781604 |
  |    clip_fraction        | 0.243       |
  |    clip_range           | 0.2         |
  |    entropy_loss         | -1.06       |
  |    explained_variance   | 0.999       |
  |    learning_rate        | 0.001       |
  |    loss                 | 0.245       |
  |    n_updates            | 180         |
  |    policy_gradient_loss | -0.00398    |
  |    std                  | 0.205       |
  |    value_loss           | 0.226       |
  -----------------------------------------
```
``eval/`` 
- ``mean_ep_length``: Mean episode length
- ``mean_reward``: Mean episodic reward (during evaluation)
``rollout/``
- ``ep_len_mean``: Mean episode length (averaged over 100 episodes)
- ``ep_rew_mean``: Mean episodic training reward (averaged over 100 episodes)
``time/``
- ``episodes``: Total number of episodes
- ``fps``: Number of frames per seconds (includes time taken by gradient update)
- ``iterations``: Number of iterations (data collection + policy update for A2C/PPO)
- ``time_elapsed``: Time in seconds since the beginning of training
- ``total_timesteps``: Total number of timesteps (steps in the environments)
``train/``
- ``entropy_loss``: Mean value of the entropy loss (negative of the average policy entropy). 
  * ⚠**According to the formula as detailed [model](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91) on line 91, if ent_coef is 0 this term should not matter which is the default hyperparamter setting; difficult to interpret for this env due to it being negative**⚠
  * **Furthermore according to [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) which cites [Andrychowicz, et al. (2021)](https://openreview.net/forum?id=nIAxjsniDzg) overall find no evidence that the entropy term improves performance on continuous control environments (decision C13, figure 76 and 77)**
- ``clip_fraction``: mean fraction of surrogate loss that was clipped (above clip_range threshold) for PPO.
- ``clip_range``: Current value of the clipping factor for the surrogate loss of PPO
- ``entropy_loss``: Mean value of the entropy loss (negative of the average policy entropy)
    *  want the entropy to be decreasing slowly and smoothly over the course of training, as the agent trades exploration in favor of exploitation.
- ``learning_rate``: Current learning rate value
- ``n_updates``: Number of gradient updates applied so far
- ``policy_gradient_loss``: Current value of the policy gradient loss (its value does not have much meaning)(lol I did not say this 😸, but yeah basically useless don't pay attention)
- ``std``: Current standard deviation of the noise when using generalized State-Dependent Exploration (gSDE) (which by default is not used)

# Important Training Metrics to Focus On!!!! ✅✅✅✅✅✅✅✅✅
- ``approx_kl``: approximate mean KL divergence between old and new policy (for PPO), it is an estimation of how much change happened in the update (i.e. information gain or loss)
  * **Want this value to SMOOTHLY DECREASE during training and be as close as possible to 0**
  * **Should be DECREASING**
- ``explained_variance``: Fraction of the return variance explained by the value function. This metric calculates how good the value function is as a predicator of future rewards
  * **Want this value to be as close as possible to 1 (i.e.perfect predictions) during training rather than less than or equal to 0 (i.e. no predictive power)**
  * **Should be INCREASING**
- ``loss``: called total loss, but is actually not a loss function, but an objective function 
  * **Want to MAXIMIZE this during training** 
  * **Should be INCREASING**
- ``value_loss``: error that value function is incurring 
  *   **Want to MINIMIZE this during training to 0 (though as discussed this isn't always possible due to randomness)**
  * **Should be DECREASING**





In [29]:
model.learn(total_timesteps=MAX_STEPS, callback=eval_callback)

Logging to /content/PPO_SPY_tensorboard/PPO_4
Eval num_timesteps=1000, episode_reward=403.94 +/- 188.15
Episode length: 403.20 +/- 187.85
---------------------------------
| eval/              |          |
|    mean_ep_length  | 403      |
|    mean_reward     | 404      |
| time/              |          |
|    total_timesteps | 1000     |
---------------------------------
New best mean reward!
Eval num_timesteps=2000, episode_reward=462.10 +/- 91.89
Episode length: 461.20 +/- 91.90
---------------------------------
| eval/              |          |
|    mean_ep_length  | 461      |
|    mean_reward     | 462      |
| time/              |          |
|    total_timesteps | 2000     |
---------------------------------
New best mean reward!
-----------------------------
| time/              |      |
|    fps             | 139  |
|    iterations      | 1    |
|    time_elapsed    | 14   |
|    total_timesteps | 2048 |
-----------------------------
Eval num_timesteps=3000, episode_reward=40

<stable_baselines3.ppo.ppo.PPO at 0x7fe34e4ccbd0>

# Prediction and Rendering Environment Portion

In [None]:
# model = PPO.load("/content/drive/MyDrive/RLmodels/bestPPO/best_model.zip")

env = StockTradingEnv(test, random_ofs_on_reset=False)
obs = env.reset()
for i in range(len(test.date)):
  action, _states = model.predict(obs, deterministic=False)
  obs, rewards, done, info = env.step(action)
  env.render()
  if done:
    break

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
NetWorth: 251939.95957214403

Reward: 250.3020202020202

Amount Held: 922.3904038210635

Amount Sold: 1088.9323610302254

Step: 256

Date: 2018-11-19

NetWorth: 252594.856758857

Reward: 251.3020202020202

Amount Held: 922.3904038210635

Amount Sold: 1088.9323610302254

Step: 257

Date: 2018-11-20

NetWorth: 248324.18918916548

Reward: 252.3020202020202

Amount Held: 922.3904038210635

Amount Sold: 1088.9323610302254

Step: 258

Date: 2018-11-21

NetWorth: 243719.97049494315

Reward: 253.3020202020202

Amount Held: 922.7960710629768

Amount Sold: 1088.9323610302254

Step: 259

Date: 2018-11-23

NetWorth: 244550.4869588998

Reward: 254.3020202020202

Amount Held: 922.7960710629768

Amount Sold: 1088.9323610302254

Step: 260

Date: 2018-11-26

NetWorth: 242917.13791311835

Reward: 255.3020202020202

Amount Held: 922.7960710629768

Amount Sold: 1088.9323610302254

Step: 261

Date: 2018-11-27

NetWorth: 246839.02121513602

Re

In [None]:
# # taken from https://stackoverflow.com/questions/5967500/how-to-correctly-sort-a-string-with-a-number-inside

# def atoi(text):
#     return int(text) if text.isdigit() else text

# def natural_keys(text):
#     '''
#     alist.sort(key=natural_keys) sorts in human order
#     http://nedbatchelder.com/blog/200712/human_sorting.html
#     (See Toothy's implementation in the comments)
#     '''
#     return [ atoi(c) for c in re.split(r'(\d+)', text) ]

# list_of_files = [img for img in os.listdir('/content') if img.endswith(".png")]
# list_of_files.sort(key=natural_keys)

In [None]:
# # taken from https://stackoverflow.com/questions/44947505/how-to-make-a-movie-out-of-images-in-python
# clip = moviepy.video.io.ImageSequenceClip.ImageSequenceClip(list_of_files, fps=fps)
# clip.write_videofile('agent_trading.mp4')

[MoviePy] >>>> Building video agent_trading.mp4
[MoviePy] Writing video agent_trading.mp4


100%|█████████▉| 241/242 [00:02<00:00, 95.18it/s]


[MoviePy] Done.
[MoviePy] >>>> Video ready: agent_trading.mp4 



In [None]:
# # taken from https://colab.research.google.com/drive/1flu31ulJlgiRL1dnN2ir8wGh9p7Zij2t#scrollTo=8nj5sjsk15IT
# def show_video():
#   mp4list = glob.glob('agent_trading.mp4')
#   if len(mp4list) > 0:
#     mp4 = mp4list[0]
#     video = io.open(mp4, 'r+b').read()
#     encoded = base64.b64encode(video)
#     ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
#                 loop controls style="height: 400px;">
#                 <source src="data:video/mp4;base64,{0}" type="video/mp4" />
#              </video>'''.format(encoded.decode('ascii'))))
#   else: 
#     print("Could not find video")

In [None]:
# show_video()

In [None]:
# !rm -r *.png

In [None]:
%tensorflow_version 2
%load_ext tensorboard
%tensorboard --logdir /content/PPO_SPY_tensorboard/ 