<a href="https://colab.research.google.com/github/aCStandke/ReinforcementLearning/blob/main/ThirdStockEnivornment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Third Stock Trading Environment


  This third stock trading environment is based on Adam King's article as found here:[Creating Bitcoin trading bots don’t lose money](https://medium.com/towards-data-science/creating-bitcoin-trading-bots-that-dont-lose-money-2e7165fb0b29). Similar to the first stock trading environment based on Maxim Lapan's implementation as found in chapter eight of his book [Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition](https://www.amazon.com/Deep-Reinforcement-Learning-Hands-optimization/dp/1838826998), the agent is trading in the environment of the [SPY ETF](https://www.etf.com/SPY?L=1) except in this trading environment the agent is tasked with two discrete actions of not only buying, selling or holding shares but also tasked with determining the amount to buy/sell ranging from 1 to 100 (which will be converted into pecentage form i.e. 1/100=1%, 100/100=100%) based on its trading account/balance [trading account](https://www.investopedia.com/terms/t/tradingaccount.asp#:~:text=A%20trading%20account%20is%20an,margin%20requirements%20set%20by%20FINRA.).  


In [1]:
# ignore warning messages because they are annoying lol
import warnings
warnings.filterwarnings('ignore')

# Installing Necessary Package for Training the Trading Agent

To train the Trading Agent the package [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/index.html) was used. As stated in the docs: 
> Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines. And steems from the paper [Stable-Baselines3: Reliable Reinforcement Learning Implementations](https://jmlr.org/papers/volume22/20-1364/20-1364.pdf)
The algorithms in this package will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of. We expect these tools will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. We also hope that the simplicity of these tools will allow beginners to experiment with a more advanced toolset, without being buried in implementation details.

---
## Proximal Policy Optimization(PPO):

Because in this environment the Agent will be executing continous actions, the Proximal Policy Optimization(PPO) algorithm was chosen. As detailed by the authors [PPO](https://arxiv.org/pdf/1707.06347.pdf)


> We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a “surrogate” objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically).


PPO uses the following novel objective function:

$L^{CLIP}(θ)=\hat{E}_t[min(r_{t}(θ)\hat{A}_t,clip(r_{t}(θ), 1-ϵ, 1+ϵ)\hat{A}_t]$

*  $\theta$ is the policy parameter
*  $\hat{E}_t$ denotes the empirical expectation over timesteps
*  $r_{t}$ is the ratio of the probability under the new and old policies, respectively
*  $\hat{A}_t$ is the estimated advantage at time t
*  $\epsilon$ is the clipping hyperparameter, usually 0.1 or 0.2


As detailed by the authors [openAI](https://openai.com/blog/openai-baselines-ppo/#ppo)


> This objective implements a way to do a Trust Region update which is compatible with Stochastic Gradient Descent, and simplifies the algorithm by removing the KL penalty and need to make adaptive updates. In tests, this algorithm has displayed the best performance on continuous control tasks and almost matches ACER’s performance on Atari, despite being far simpler to implement


  









In [None]:
!pip install stable-baselines3[extra]
!pip install empyrical
!pip install optuna
!pip install --upgrade importlib-metadata==4.13.0

In [3]:
import random
import gym 
from gym import spaces
from gym.utils import seeding
import pandas as pd
import numpy as np
import json
import datetime as dt
import optuna
from typing import Callable, Dict, List, Optional, Tuple, Type, Union
from stable_baselines3 import PPO
from stable_baselines3.common.utils import constant_fn
from stable_baselines3.common.policies import ActorCriticPolicy
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
from stable_baselines3.common.env_util import DummyVecEnv
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.env_checker import VecCheckNan, check_env
from stable_baselines3.common.callbacks import BaseCallback
from empyrical import sortino_ratio, calmar_ratio, omega_ratio
import sqlite3
from sqlite3 import Error
import torch as th
import torch.nn as nn
import collections
import datetime
from sklearn import preprocessing
import math
import os
import csv
from csv import DictWriter

# Third Stock Environment



In [4]:
# stock environment parameters
MAX_ACCOUNT_BALANCE = 2147483647
MAX_NUM_SHARES = 2147483647
MAX_SHARE_PRICE = 4294967295
LOOKBACK_WINDOW_SIZE = 10
MAX_STEPS = 20000
INITIAL_ACCOUNT_BALANCE = 10000
# default percentage of stock price trading agent pays broker when 
# buying/selling, default is 0.1% (i.e. very reasonable)
DA_COMMISION = 0.1

In [5]:
# Stock/ETF Trading Enviornment
class StockTradingEnv(gym.Env):
    """A stock trading environment for OpenAI gym"""
    metadata = {'render.modes': ['human']}

    def __init__(self, data, reward_func='BalenceReward', random=True):
        super(StockTradingEnv, self).__init__()
        self.data = data
        self.scale = preprocessing.MinMaxScaler()
        self.random_ofs_on_reset = random
        self.reward_func = reward_func
        self.bars_count = LOOKBACK_WINDOW_SIZE
        self.commission = DA_COMMISION
        self.hold= False

        # Actions of the format Buy x%, Sell x%, Hold, etc.
        self.action_space = spaces.Box(low=np.array([0, 0]), high=np.array([3, 1]), dtype=np.float32)

        # Prices contains the OHCL values for the last five prices the state 
        # space is 12 dim i.e. 6 x 2 = 12
        self.observation_space = spaces.Box(
            low=0, high=1, shape=self.shape, dtype=np.float32)
        
        self.seed()

    def reset(self):
      # random offset portion 
      bars = self.bars_count
      if self.random_ofs_on_reset:
        offset = self.np_random.choice(self.data.high.shape[0]-bars*10)+bars
      else:
        offset = bars
      self._reset(offset)
      return self._next_observation()

    def _reset(self, offset):
      self.trades = []
      self.balance = INITIAL_ACCOUNT_BALANCE
      self.netWorth = INITIAL_ACCOUNT_BALANCE
      self.max_net_worth = INITIAL_ACCOUNT_BALANCE
      self.standkeMaxBenchShares = 0
      self.shares_held  = 0
      self._offset = offset
      # setting account history portion
      self.account_history = np.repeat([[self.netWorth/MAX_ACCOUNT_BALANCE]], LOOKBACK_WINDOW_SIZE, axis=1)

    # shape of observation space is 2D
    @property
    def shape(self):
      return (6, self.bars_count)

    def _next_observation(self):
      res = np.zeros(shape=(6, self.bars_count), dtype=np.float32)
      ofs = self.bars_count-1
      res[0] = self.data.volume[self._offset-ofs:self._offset+1]
      res[1] = self.data.high[self._offset-ofs:self._offset+1]
      res[2] = self.data.low[self._offset-ofs:self._offset+1]
      res[3] = self.data.open[self._offset-ofs:self._offset+1]
      res[4] = self.account_history[0][-self.bars_count:]
      res[5] = self.data.close[self._offset-ofs:self._offset+1]
      res = np.float32(res)
      return res
       
    def _take_action(self, action):
      reward = 0
      current_price = self._cur_close()
      action_type = action[0]
      amount = action[1]
      
      shares_bought = 0
      shares_sold = 0
      additional_cost = 0
      sales = 0


      if action_type < 1 :
        # Buy amount % of balance in shares
        total_possible = self.balance / (current_price * (1+self.commission))
        shares_bought = total_possible * amount
        additional_cost = shares_bought * current_price * (1+self.commission)
        self.balance -= additional_cost
        self.standkeMaxBenchShares += shares_bought
        self.shares_held += shares_bought
        
        
        # visualization portion
        if shares_bought > 0:
          self.trades.append({'step': self._offset, 'shares': shares_bought, 
                              'total': additional_cost, 'type': "buy"})
          
          
      elif action_type < 2:
        # Sell amount % of shares held
        shares_sold = self.shares_held * amount  
        sales = shares_sold * current_price * (1 - self.commission)
        self.balance += sales
        self.standkeMaxBenchShares -= shares_sold
        self.shares_held -= shares_sold
        

        # visualization portion
        if shares_sold > 0:
          self.trades.append({'step': self._offset, 'shares': -shares_sold, 
                                  'total': shares_sold * current_price, 'type': "sell"})  
          
      
      self.netWorth = self.balance + self.shares_held * current_price
      
      if self.netWorth > self.max_net_worth:
        self.max_net_worth = self.netWorth

      # updating account history
      self.account_history = np.append(self.account_history, [[self.netWorth/MAX_ACCOUNT_BALANCE]], axis=1)
      # reward Calculations
      returns = self.account_history[0][-self.bars_count:]
      if self.reward_func == 'BalenceReward':
        delay_modifier = (self._offset / MAX_STEPS)
        reward = self.balance * delay_modifier
      elif self.reward_func == 'sortinoRewardRatio':
        ratio = sortino_ratio(returns, period="daily")
        reward= ratio * self.balance
      elif self.reward_func == 'calmarRewardRatio':
        ratio = calmar_ratio(returns, period="daily")
        reward= ratio * self.balance
      elif self.reward_func == 'omegaRewardRatio':
        ratio = omega_ratio(returns,  annualization=self.bars_count)
        reward= ratio * self.balance
      elif self.reward_func == 'StandkeCurrentValueReward':
        prev_net = returns[-2]
        current_net = returns[-1]
        ratio = current_net-prev_net
        reward = ratio * self.balance
      elif self.reward_func == 'StandkeSmallDrawDownReward':
        mx = np.max(returns)
        mi = np.min(returns)
        ratio = round(abs(mx-mi/mx), 1) 
        reward = ratio * self.balance
      elif self.reward_func == 'StandkeSumofDifferenceReward':
        ratio = np.sum(np.diff(returns))
        reward = ratio * self.balance
      else:
        ratio = np.mean(returns)
        reward = ratio * self.balance
      return reward if abs(reward) != np.inf and not np.isnan(reward) else 0

      
    def _cur_close(self):
      """
      Calculate real close price for the current bar
      """
      return self.data.real_close[self._offset]

    def step(self, action):
      # Execute one time step within the environment
      reward = self._take_action(action)
    
      self._offset += 1

      if self._offset >= self.data.close.shape[0]-1 or self.netWorth <= 0 or self.netWorth>=MAX_ACCOUNT_BALANCE:
        done=True
      else:
        done=False
  
      obs = self._next_observation()

      info = {"Net Worth":self.netWorth, "reward": reward}
      
      return obs, reward, done, info

    def _render_to_file(self, filename='results.csv'):
      csv_columns = ['Date','Net_Worth','Balence', 'StandkeMaxBenchShares']
      dict_data = {'Date':self.data.date[self._offset], 'Net_Worth':self.netWorth, 'Balence':self.balance, 'StandkeMaxBenchShares':self.standkeMaxBenchShares}
      with open(filename, 'a+', newline='') as f:
        writer = DictWriter(f, fieldnames=csv_columns)
        writer.writerow(dict_data)
        f.close()
 
    def render(self, mode='file', title="Agent's Trading Screen", **kwargs):
      # Render the environment to the screen
      if mode == 'file':
        self._render_to_file()


    def seed(self, seed=None):
      self.np_random, seed1 = seeding.np_random(seed)
      seed2 = seeding.hash_seed(seed1+1) % 2**33
      return [seed1, seed2]


# Data Preprocessing

1.   First the data is made [stationary](https://machinelearningmastery.com/remove-trends-seasonality-difference-transform-python/) to remove any trends or seasonality associated with the time series data
2.   Then the price data is converted into releative prices to model  the relative change rather than absolute change 
3. Lastly the data is normalized using sklearn's [min-max scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) so as to fit within the environment's observation space of [0,1]


In [6]:
# using sklearn's min-max scaler for the relative high and low
x=preprocessing.MinMaxScaler()

# create a differenced series as done in step 1 (see link for more info)
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = np.log(dataset[i]) - np.log(dataset[i - interval])
		diff.append(value)
	return diff
 
# training data
df = pd.read_csv('/content/drive/MyDrive/Datasets/StockMarketData/archive/Data/ETFs/spy.us.txt')
df = df.sort_values('Date')
data=df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]

# making OHLC data stationary before calculating relative and normalizing 
diff_o = np.array(difference(data['Open'], 1))
diff_h = np.array(difference(data['High'], 1))
diff_l = np.array(difference(data['Low'], 1))
diff_c = np.array(difference(data['Close'], 1))
# volumne data
vol = data['Volume'].values/MAX_NUM_SHARES
# year data of year-month-day form
dt = data['Date'].array
# calculating relative prices and normalizing data
o =  (diff_o-diff_l)/(diff_h-diff_l)
o =  x.fit_transform(o.reshape(-1,1)).reshape(-1)
rc = (diff_c-diff_l)/(diff_h-diff_l)
rc = x.fit_transform(rc.reshape(-1,1)).reshape(-1)

rh = x.fit_transform(diff_h.reshape(-1,1)).reshape(-1)
rl = x.fit_transform(diff_l.reshape(-1,1)).reshape(-1)

Train_Data = collections.namedtuple('Data', field_names=['date','high', 'low', 'close', 'open', 'volume', 'real_open',  'real_close', 'real_high', 'real_low', 'real_vol'])
train = Train_Data(date=dt,high=rh, low=rl, close=rc, open=o, volume=vol, real_open=data['Open'].values, real_close=data['Close'].values, real_high=data['High'].values, real_low=data['Low'].values, real_vol=data['Volume'].values)

In [7]:
# Testing data
test = pd.read_csv('/content/drive/MyDrive/Datasets/StockMarketData/test.csv')
t_df = test.sort_values('Date')
data_two=t_df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]

# making OHLC data stationary before calculating relative and normalizing 
diff_o = np.array(difference(data_two['Open'], 1))
diff_h = np.array(difference(data_two['High'], 1))
diff_l = np.array(difference(data_two['Low'], 1))
diff_c = np.array(difference(data_two['Close'], 1))
# volumne data
vol = data_two['Volume'].values/MAX_NUM_SHARES
# year data of year-month-day form
dt = data_two['Date'].array
# calculating relative prices and normalizing data
o =  (diff_o-diff_l)/(diff_h-diff_l)
o =  x.fit_transform(o.reshape(-1,1)).reshape(-1)
rc = (diff_c-diff_l)/(diff_h-diff_l)
rc = x.fit_transform(rc.reshape(-1,1)).reshape(-1)

rh = x.fit_transform(diff_h.reshape(-1,1)).reshape(-1)
rl = x.fit_transform(diff_l.reshape(-1,1)).reshape(-1)

Test_Data = collections.namedtuple('Data', field_names=['date','high', 'low', 'close', 'open', 'volume', 'real_open', 'real_close', 'real_high', 'real_low', 'real_vol'])
test = Test_Data(date=dt,high=rh, low=rl, close=rc, open=o, volume=vol, real_open=data['Open'].values, real_close=data_two['Close'].values, real_high=data_two['High'].values, real_low=data_two['Low'].values, real_vol=data['Volume'].values)

# Creating Seperate Policy/Value Network Class using a 1D CNN Feature extractor

Stable-baselines3's lists the following blog on PPO [37 implementation details of PPO](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) which breaks down the differnt implementations of PPO. Furthermore, as the authors of [WHAT MATTERS FOR ON-POLICY DEEP ACTORCRITIC METHODS? A LARGE-SCALE STUDY](https://openreview.net/pdf?id=nIAxjsniDzg) detail: 


> Separate value and policy networks (C47) appear to lead to better performance on our out of five environments (Fig. 15). To avoid analyzing the other choices based on bad models, we thus focus for the rest of this experiment only on agents with separate value and policy networks. Regarding network sizes, the optimal width of the policy MLP depends on the complexity of the environment (Fig. 18) and too low or too high values can cause significant drop in performance while for the value function there seems to be no downside in using wider networks (Fig. 21). Moreover, on some environments it is beneficial to make the value network wider than the policy one, e.g. on HalfCheetah the best results are achieved with 16 − 32 units per layer in the policy network and 256
in the value network. Two hidden layers appear to work well for policy (Fig. 22) and value networks (Fig. 20) in all tested environments. As for activation functions, we observe that tanh activations perform best and relu worst

With this statement in mind, I decided to implement a seperate-network architecture for PPO. 

## 1D CNN Feature Extractor

I decided to use two seperate 1D CNN Feature Extractors for the policy and value network. I decided upon the following architecture for the  Policy feature extractor: 

```
self.cnn = nn.Sequential(
            nn.Conv1d(input, 32, kernel_size=2),
            nn.ReLU(),
            nn.Conv1d(32, 64, kernel_size=4),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=4),
            nn.Flatten(),
```
And I decided upon the following architecture for the Value feature extractor:
```
self.cnn = nn.Sequential(
            nn.Conv1d(input,128,kernel_size=2),
            nn.ReLU(),
            nn.Conv1d(128, 256, kernel_size=4),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=4),
            nn.Flatten(),
        )

```
I decided on the following architecure for the Policy Network: 

## Policy Network

```
  self.policy_net = nn.Sequential(
            layer_init(nn.Linear(feature_dim, 32)),
            nn.Tanh(),
            layer_init(nn.Linear(32, last_layer_dim_pi), std=0.01),
            nn.Tanh(),  
        )
```

## Value Network 

And I decided on the following architecure for the Value Network: 

```
self.value_net = nn.Sequential(
           layer_init(nn.Linear(256, 256)),
           nn.Tanh(),
           layer_init(nn.Linear(256, 128)),
           nn.Tanh(),
           layer_init(nn.Linear(128,32)),
           nn.Tanh(),
           layer_init(nn.Linear(32,last_layer_dim_vf)),
           nn.Tanh(),
           )
```



In [8]:
def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
    th.nn.init.orthogonal_(layer.weight, std)
    th.nn.init.constant_(layer.bias, bias_const)
    return layer

class StandkePolicyExtractor(BaseFeaturesExtractor):
  def __init__(self, observation_space=gym.spaces.Box, features_dim=128):
        super(StandkePolicyExtractor, self).__init__(observation_space, features_dim)
        input = observation_space.shape[0]
        # Feature Extractor
        self.cnn = nn.Sequential(
            nn.Conv1d(input,16,kernel_size=2),
            nn.ReLU(),
            nn.Conv1d(16, 32, kernel_size=4),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=4),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]
  
  def forward(self, observations):
    return self.cnn(observations)

class StandkeValueExtractor(BaseFeaturesExtractor):
  def __init__(self, observation_space=gym.spaces.Box, features_dim=128):
        super(StandkeValueExtractor, self).__init__(observation_space, features_dim)
        input = observation_space.shape[0]
        # Feature Extractor
        self.cnn = nn.Sequential(
            nn.Conv1d(input,128,kernel_size=2),
            nn.ReLU(),
            nn.Conv1d(128, 256, kernel_size=4),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=4),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            n_flatten = self.cnn(
                th.as_tensor(observation_space.sample()[None]).float()
            ).shape[1]

        
  def forward(self, observations):
    return self.cnn(observations)



class StandkeNetwork(nn.Module):
  def __init__(self,feature_dim=32, last_layer_dim_pi=2, last_layer_dim_vf=1):
        super(StandkeNetwork, self).__init__()

        # IMPORTANT:
        # Save output dimensions, used to create the distributions
        self.latent_dim_pi = last_layer_dim_pi
        self.latent_dim_vf = last_layer_dim_vf

         # Policy Network
        self.policy_net = nn.Sequential(
            layer_init(nn.Linear(feature_dim, 32)),
            nn.Tanh(),
            layer_init(nn.Linear(32, last_layer_dim_pi), std=0.01),
            nn.Tanh(),  
        )


         # Value Network
        self.value_net = nn.Sequential(
           layer_init(nn.Linear(256, 256)),
           nn.Tanh(),
           layer_init(nn.Linear(256, 128)),
           nn.Tanh(),
           layer_init(nn.Linear(128,32)),
           nn.Tanh(),
           layer_init(nn.Linear(32,last_layer_dim_vf)),
           nn.Tanh(),
           )

  def forward_actor(self, features: th.Tensor):
    return self.policy_net(features)

  def forward_critic(self, features: th.Tensor):
    return self.value_net(features)
  

class StandkePolicy(ActorCriticPolicy):
  def __init__(self, observation_space=gym.spaces.Box,action_space=gym.spaces.Box, lr_schedule=constant_fn(0.0003), activation_fn=nn.Tanh,*args,**kwargs):
    
        super(StandkePolicy, self).__init__(observation_space,action_space, lr_schedule, activation_fn,*args,**kwargs)
        # non-shared features extractors 
        self.policy_features_extractor = StandkePolicyExtractor(observation_space)
        self.value_features_extractor = StandkeValueExtractor(observation_space)
        delattr(self, "features_extractor")  # remove the shared features extractor
        # orthogonal initialization
        self.ortho_init = False

  def _build_mlp_extractor(self):
    self.mlp_extractor = StandkeNetwork()

  def extract_features(self, obs: th.Tensor):
    policy_features = self.policy_features_extractor(obs)
    value_features = self.value_features_extractor(obs)
    return policy_features, value_features
  
  def forward(self, obs: th.Tensor, deterministic=False): 
    policy_features, value_features = self.extract_features(obs)
    mu_pi = self.mlp_extractor.forward_actor(policy_features)
    latent_vf = self.mlp_extractor.forward_critic(value_features)
    # Evaluate the values for the given observations
    distribution = self._get_action_dist_from_latent(mu_pi)
    actions = distribution.get_actions(deterministic=deterministic)
    log_prob = distribution.log_prob(actions)
    vf = latent_vf
    return actions, vf, log_prob

  def evaluate_actions(self, obs: th.Tensor, actions: th.Tensor): 
    policy_features, value_features = self.extract_features(obs)
    mu_pi = self.mlp_extractor.forward_actor(policy_features)
    latent_vf = self.mlp_extractor.forward_critic(value_features)
    distribution = self._get_action_dist_from_latent(mu_pi)
    actions = distribution.get_actions(deterministic=False)
    log_prob = distribution.log_prob(actions)
    vf = latent_vf
    return vf, log_prob, distribution.entropy()

  def get_distribution(self, obs: th.Tensor):
    policy_features, _ = self.extract_features(obs)
    latent_pi = self.mlp_extractor.forward_actor(policy_features)
    return self._get_action_dist_from_latent(latent_pi)

  def predict_values(self, obs: th.Tensor):
    _, value_features = self.extract_features(obs)
    latent_vf = self.mlp_extractor.forward_critic(value_features)
    return latent_vf

In [9]:
class TensorboardCallback(BaseCallback):
    def __init__(self, verbose=0):
        super(TensorboardCallback, self).__init__(verbose)
        self.mean_reward = []

    def _on_step(self) -> bool:
      self.mean_reward.append(self.locals['infos'][0]['reward'])
      if (self.num_timesteps % 10000 == 0):
        self.logger.record('Net Worth', self.locals['infos'][0]['Net Worth'])
        self.logger.record('Mean Reward', np.mean(self.mean_reward))
        self.mean_reward.clear()
      return True

# HyperParmater Tuning 

Following the optimization sceme as outlined by Adam King in his article [Optimizing deep learning trading bots using state-of-the-art techniques](https://towardsdatascience.com/using-reinforcement-learning-to-trade-bitcoin-for-massive-profit-b69d0e8f583b) Bayesian Optimization was done using [Optuna](https://optuna.org/)
After doing a categorical trial on the different rewards I have implemented, namely: 
* BalenceReward: simple reward that Adam King created that multiplies the balance by a delay which is just the offset/step in the environment
* [sortinoRewardRatio](https://www.investopedia.com/terms/s/sortinoratio.asp) this ratio is multiplied by the balance  
* [calmarRewardRatio](https://www.investopedia.com/terms/c/calmarratio.asp)this ratio is multiplied by the balance   
* [omegaRewardRatio](https://www.wallstreetmojo.com/omega-ratio/) this ratio is multiplied by the balance  
* StandkeCurrentValueReward: simple reward I created that is the difference of the previous trading day's networth and the current trading day's networth and is multiplied by the balance
* StandkeSmallDrawDownReward: reward I created that takes the maximum and minimum networth of the past 10 trading days divided by the maximum value of the past 10 trading days and is multiplied by the balance 
* StandkeSumofDifferenceReward: simple reward I created that takes the difference of the past 10 trading days and sums the values before multiplying it by the balance

BalenceReward was chosen as the defualt reward scheme for testing the hyperparmeters outlined by Adam King and the following hyperparmaters:

!!!Still doing!!!!!!

*  gamma: 
*  clip_range: 
*  clip_range_vf:
*  ent_coef:
*  vf_coef:
*  target_kl: 

 



In [None]:
def objective_fn(trial):
    # env_params = optimize_envs(trial) # just using default reward, test later
    agent_params = optimize_ppo(trial)
    
    train_env = DummyVecEnv([lambda: StockTradingEnv(train, random=False)])
    validation_env = DummyVecEnv([lambda: StockTradingEnv(test, random=False)])
    model = PPO(StandkePolicy, train_env, **agent_params)
    
    model.learn(len(train_env.get_attr('data')[0].date)) # trains based on length of data 
                                                         # approx 3000
    rewards, done = [], False
    obs = validation_env.reset()
    for i in range(len(validation_env.get_attr('data')[0].date)):
        action, _ = model.predict(obs)
        obs, reward, done, _ = validation_env.step(action)
        if done:
          break
        if abs(reward) != np.inf and not np.isnan(reward):
          rewards.append(reward)
        else:
          pass
    return np.mean(rewards)

def optimize_ppo(trial):
    return {
        'gamma': trial.suggest_float('gamma1', 0.1, 0.9, log=True),
        'clip_range': trial.suggest_float('clip_range1', 0.1, 0.5, log=True),
        'clip_range_vf': trial.suggest_float('clip_range_vf2', 0.1, 0.5, log=True),
        'ent_coef':trial.suggest_float('ent_coef3', 0.1, 0.9, log=True), 
        'vf_coef':trial.suggest_float('vf_coef4', 0.1, 0.9, log=True),
        'target_kl': trial.suggest_float('target_kl5', 0.01, 0.05, log=True),
    }

def optimize_envs(trial):
    return {'reward_func': trial.suggest_categorical('reward_func', ['BalenceReward', 'sortinoRewardRatio', 'calmarRewardRatio', 'omegaRewardRatio', 'StandkeCurrentValueReward', 'StandkeSmallDrawDownReward', 'StandkeSumofDifferenceReward', 'Mean'])}


study = optuna.create_study(study_name='StockEnvPPO_Parms', direction="maximize", storage="sqlite:///PPOhyper.db", load_if_exists=True)
study.optimize(objective_fn, n_trials=5000, show_progress_bar=True)

In [None]:
# loading the hyperparmeters for training
study = optuna.load_study(study_name='StockEnvPPO_Parms', storage='sqlite:///PPOhyper.db')
params = study.best_trial.params

# Training and Validation Portion

In [None]:
# number of learning steps to train RL model is set to 100K
MAX_STEPS = 1e5
# the number of parallel environments for training  
ENV = 1
MODEL = "StandkePV"

# hyperparameters to use for the env and agent
env_params = {'reward_func': 'BalenceReward'}
model_params = { 
 'gae_lambda': 0.9 
}

In [None]:
# create evaluation env that takes in test data that saves best model 
eval_env = DummyVecEnv([lambda: StockTradingEnv(test, **env_params, random=False)])
eval_callback = EvalCallback(eval_env, best_model_save_path=f'/content/drive/MyDrive/RLmodels/bestPPO/{MODEL}',
                             log_path='/content/drive/MyDrive/RLmodels/logs/', eval_freq=MAX_STEPS/100,
                             deterministic=False, render=False)

# create training envs that takes in training data for training

envs =  DummyVecEnv([lambda: StockTradingEnv(train, **env_params, random=False) for _ in range(0,ENV)])

'''training model using the Seperate Standke Policy/Value network'''
# optional additional keyword parameters to pass to model 
policy_kwargs = dict()
model = PPO(StandkePolicy, envs, **model_params, verbose=1, tensorboard_log=f"/content/PPO_SPY_tensorboard/{MODEL}", policy_kwargs=policy_kwargs)

# check to make sure no erros in the env, such as observation space errors or 
# nans
check_env(StockTradingEnv(train))
VecCheckNan(envs, raise_exception=True, check_inf=True)


# General explanation of log output 

As detailed by araffin in his commit [Add explanation of logger output](https://github.com/DLR-RM/stable-baselines3/pull/803/files), for a given log block such as

```
-----------------------------------------
  | eval/                   |             |
  |    mean_ep_length       | 200         |
  |    mean_reward          | -157        |
  | rollout/                |             |
  |    ep_len_mean          | 200         |
  |    ep_rew_mean          | -227        |
  | time/                   |             |
  |    fps                  | 972         |
  |    iterations           | 19          |
  |    time_elapsed         | 80          |
  |    total_timesteps      | 77824       |
  | train/                  |             |
  |    approx_kl            | 0.037781604 |
  |    clip_fraction        | 0.243       |
  |    clip_range           | 0.2         |
  |    entropy_loss         | -1.06       |
  |    explained_variance   | 0.999       |
  |    learning_rate        | 0.001       |
  |    loss                 | 0.245       |
  |    n_updates            | 180         |
  |    policy_gradient_loss | -0.00398    |
  |    std                  | 0.205       |
  |    value_loss           | 0.226       |
  -----------------------------------------
```
``eval/`` 
- ``mean_ep_length``: Mean episode length
- ``mean_reward``: Mean episodic reward (during evaluation)
``rollout/``
- ``ep_len_mean``: Mean episode length (averaged over 100 episodes)
- ``ep_rew_mean``: Mean episodic training reward (averaged over 100 episodes)
``time/``
- ``episodes``: Total number of episodes
- ``fps``: Number of frames per seconds (includes time taken by gradient update)
- ``iterations``: Number of iterations (data collection + policy update for A2C/PPO)
- ``time_elapsed``: Time in seconds since the beginning of training
- ``total_timesteps``: Total number of timesteps (steps in the environments)
``train/``
- ``entropy_loss``: Mean value of the entropy loss (negative of the average policy entropy). 
  * ⚠**According to the formula as detailed [model](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91) on line 91, if ent_coef is 0 this term should not matter which is the default hyperparamter setting; difficult to interpret for this env due to it being negative**⚠
  * **Furthermore according to [The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) which cites [Andrychowicz, et al. (2021)](https://openreview.net/forum?id=nIAxjsniDzg) overall find no evidence that the entropy term improves performance on continuous control environments (decision C13, figure 76 and 77)**
- ``clip_fraction``: mean fraction of surrogate loss that was clipped (above clip_range threshold) for PPO.
- ``clip_range``: Current value of the clipping factor for the surrogate loss of PPO
- ``entropy_loss``: Mean value of the entropy loss (negative of the average policy entropy)
    *  want the entropy to be decreasing slowly and smoothly over the course of training, as the agent trades exploration in favor of exploitation.
- ``learning_rate``: Current learning rate value
- ``n_updates``: Number of gradient updates applied so far
- ``policy_gradient_loss``: Current value of the policy gradient loss (its value does not have much meaning)(lol I did not say this 😸)
- ``std``: Current standard deviation of the noise when using generalized State-Dependent Exploration (gSDE) (which by default is not used)

# Important Training Metrics to Focus On!!!! ✅✅✅✅✅✅✅✅✅
- ``approx_kl``: approximate mean KL divergence between old and new policy (for PPO), it is an estimation of how much change happened in the update (i.e. information gain or loss)
  * **Want this value to SMOOTHLY DECREASE during training and be as close as possible to 0**
  * **Should be DECREASING**
- ``explained_variance``: Fraction of the return variance explained by the value function. This metric calculates how good the value function is as a predicator of future rewards
  * **Want this value to be as close as possible to 1 (i.e.perfect predictions) during training rather than less than or equal to 0 (i.e. no predictive power)**
  * **Should be INCREASING**
- ``loss``: called total loss is the the overall loss function
  * **Want to MINIMIZE this during training** 
  * **Should be DECREASING**
- ``value_loss``: error that value function is incurring 
  *   **Want to MINIMIZE this during training to 0 (though as discussed this isn't always possible due to randomness)**
  * **Should be DECREASING**





In [None]:
model.learn(total_timesteps=MAX_STEPS, callback=[eval_callback,TensorboardCallback()])

# Prediction and Printout of Agent's Trading Strategy on Test Data

In [None]:
model = PPO.load(f"/content/drive/MyDrive/RLmodels/bestPPO/{MODEL}/best_model.zip")
env = StockTradingEnv(test, **env_params, random=False)
obs = env.reset()
for i in range(len(test.date)):
  action, _states = model.predict(obs, deterministic=False)
  obs, rewards, done, info = env.step(action)
  env.render()
  if done:
    break

# TensorBoard Analysis

In [None]:
%load_ext tensorboard
%tensorboard --logdir /content/PPO_SPY_tensorboard/ 