<a href="https://colab.research.google.com/github/aCStandke/ReinforcementLearning/blob/main/SecondStockEnivornment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Second Stock Trading Environment


> This second stock environment is based on Adam King's article as found here:[Create custom gym environments from scratch — A stock market example](https://towardsdatascience.com/creating-a-custom-openai-gym-environment-for-stock-trading-be532be3910e). Similar to the first stock trading environment based on Maxim Lapan's implementation as found in chapter eight of his book [Deep Reinforcement Learning Hands-On: Apply modern RL methods to practical problems of chatbots, robotics, discrete optimization, web automation, and more, 2nd Edition](https://www.amazon.com/Deep-Reinforcement-Learning-Hands-optimization/dp/1838826998), the agent is trading in the environment of the [SPY ETF](https://www.etf.com/SPY?L=1) except in this trading environment the agent is taking continuous actions, rather than discrete actions and is tasked with managing a [trading account](https://www.investopedia.com/terms/t/tradingaccount.asp#:~:text=A%20trading%20account%20is%20an,margin%20requirements%20set%20by%20FINRA.).  In the first trading environment, the agent's reward is based on relative price movement, however in this trading environment the agent's reward is based on managing its trading account. As Adam King details the agent can take two continous actions; namely, either buying or selling the SPY and by what percentage. 







In [49]:
import warnings
warnings.filterwarnings('ignore')

In [50]:
!pip install stable-baselines3[extra]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [51]:
import random
import json
import gym
from gym import spaces
from gym.utils import seeding
import pandas as pd
import numpy as np
import json
import datetime as dt
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
import collections
import datetime

In [52]:
# Stock Environment Parameters
MAX_ACCOUNT_BALANCE = 2147483647
MAX_NUM_SHARES = 2147483647
MAX_STEPS = 20000
TRADING_DAYS = 5
DEFAULT_COMMISSION_PERC = 0.01
INITIAL_ACCOUNT_BALANCE = 10000

In [63]:
# Stock/ETF Trading Enviornment
class StockTradingEnv(gym.Env):
    """A stock trading environment for OpenAI gym"""
    metadata = {'render.modes': ['human']}

    def __init__(self, data, random_ofs_on_reset=True, commission_prec=DEFAULT_COMMISSION_PERC):
        super(StockTradingEnv, self).__init__()

        self.data = data
        self.random_ofs_on_reset = random_ofs_on_reset
        self.bars_count = TRADING_DAYS
        self.commission_perc = commission_prec
        self.buy_queue = []
        self.sell_queue = []

        # Actions of the format Buy x%, Sell x%, Hold, etc.
        self.action_space = spaces.Box(
            low=np.array([0, 0]), high=np.array([3, 1]), dtype=np.float32)

        # Prices contains the OHCL values for the last five prices
        self.observation_space = spaces.Box(
            low=0, high=1, shape=self.shape)
        
        self.random_ofs_on_reset = random_ofs_on_reset
        self.seed()

    def reset(self):
      bars = self.bars_count
      if self.random_ofs_on_reset:
        offset = self.np_random.choice(self.data.high.shape[0]-bars*10)+bars
      else:
        offset = bars
      self._reset(offset)
      return self._next_observation()  

    @property
    def shape(self):
      return (4*self.bars_count+4, )

    def _next_observation(self):
        # Get the stock data points for the last 5 days and scale to between 0-1
        res = np.ndarray(shape=self.shape, dtype=np.float32)
        shift = 0
        for bar_idx in range(-self.bars_count+1, 1):
          res[shift] = self.data.high[self._offset + bar_idx]
          shift += 1
          res[shift] = self.data.low[self._offset + bar_idx]
          shift += 1
          res[shift] = self.data.close[self._offset + bar_idx]
          shift += 1
          res[shift] = self.data.volume[self._offset + bar_idx]
          shift += 1

        # Append additional data and scale each value to between 0-1
        res[shift] = self.balance / MAX_ACCOUNT_BALANCE
        shift += 1
        res[shift] = self.shares_held / MAX_NUM_SHARES
        shift += 1
        res[shift] = self.total_shares_sold / MAX_NUM_SHARES
        shift += 1
        res[shift] = self.total_shares_bought/MAX_NUM_SHARES
      
        return res

    def _take_action(self, action):

        reward = 0.0
        current_price = self._cur_close()
        action_type = action[0]
        amount = action[1]

        if action_type < 1:
            total_possible = int(self.balance / current_price)
            total_possible = abs(total_possible)
            shares_bought = int(total_possible * amount)
            additional_cost = (shares_bought * current_price) + (shares_bought * current_price)*self.commission_perc

            # balance calculation 
            self.balance -= additional_cost
         

            self.shares_held += shares_bought
            self.total_shares_bought += shares_bought
            

        elif action_type < 2:
            shares_sold = int(self.shares_held * amount)
            
            # balance calculation 
            self.balance += (shares_sold * current_price) - (shares_sold * current_price)*self.commission_perc
           

            self.shares_held -= shares_sold
            self.total_shares_sold += shares_sold


        
        self.net_worth = self.balance + self.shares_held * current_price
        reward += self.balance


        if self.net_worth > self.max_net_worth:
          self.max_net_worth = self.net_worth

        
        self._offset += 1

        return reward 

    def _cur_close(self):
      """
      Calculate real close price for the current bar
      """
      open = self.data.open[self._offset]
      rel_close = self.data.close[self._offset]
      return open * (1.0 + rel_close)

    def step(self, action):
        # Execute one time step within the environment
        reward = self._take_action(action)

        
        if self.balance <= 0 or self.net_worth >= MAX_ACCOUNT_BALANCE or self._offset >= self.data.close.shape[0]-1:
          done=True
        else:
          done=False


        obs = self._next_observation()

        return obs, reward, done, {}

    def _reset(self, offset):
        # Reset the state of the environment to an initial state
        self.balance = INITIAL_ACCOUNT_BALANCE
        self.net_worth = INITIAL_ACCOUNT_BALANCE
        self.max_net_worth = INITIAL_ACCOUNT_BALANCE
        self.shares_held = 0
        self.total_shares_sold = 0
        self.total_sales_value = 0
        self.total_shares_bought = 0
        self._offset = offset



    def render(self, mode='human', close=False):
      # Render the environment to the screen
      print(self.balance)


    def seed(self, seed=None):
      self.np_random, seed1 = seeding.np_random(seed)
      seed2 = seeding.hash_seed(seed1+1) % 2**33
      return [seed1, seed2]


In [65]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/StockMarketData/archive/Data/ETFs/spy.us.txt')
df = df.sort_values('Date')
data=df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]

# year data of year-month-day form
dt = data['Date'].array
# calculating relative prices 
rh = (data['High'].values-data['Open'].values)/data['Open'].values
rl = (data['Low'].values-data['Open'].values)/data['Open'].values
rc = (data['Close'].values-data['Open'].values)/data['Open'].values
o = data['Open'].values
# volumne data
vol = data['Volume'].values

Data = collections.namedtuple('Data', field_names=['date','high', 'low', 'close', 'open', 'volume'])
data=Data(date=dt,high=rh, low=rl, close=rc, open=o, volume=vol)

In [66]:
# The algorithms require a vectorized environment to run
env = StockTradingEnv(data)
model = PPO("MlpPolicy", env, verbose=1)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [67]:
model.learn(total_timesteps=MAX_STEPS)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 36.3     |
|    ep_rew_mean     | 8.57e+04 |
| time/              |          |
|    fps             | 806      |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 43.1         |
|    ep_rew_mean          | 9.2e+04      |
| time/                   |              |
|    fps                  | 604          |
|    iterations           | 2            |
|    time_elapsed         | 6            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0036704834 |
|    clip_fraction        | 0.00532      |
|    clip_range           | 0.2          |
|    entropy_loss         | -2.84        |
|    explained_variance   | 1.79e-07     |
|    learning_r

<stable_baselines3.ppo.ppo.PPO at 0x7f3739f54e50>

In [69]:
obs = env.reset()
for i in range(30):
  action, _states = model.predict(obs)
  obs, rewards, done, info = env.step(action)
  env.render()

6998.1487
6998.1487
6998.1487
6998.1487
5126.8611
5126.8611
4410.9226
1967.1164999999996
1967.1164999999996
1967.1164999999996
1967.1164999999996
1967.1164999999996
1967.1164999999996
2828.5352999999996
2828.5352999999996
2828.5352999999996
4791.9429
4791.9429
4791.9429
3063.3885
3063.3885
3063.3885
3063.3885
4923.4896
4923.4896
4923.4896
4923.4896
61.30919999999969
61.30919999999969
61.30919999999969
