<a href="https://colab.research.google.com/github/aCStandke/ReinforcementLearning/blob/main/SpyTradingAgent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Trading Using Reinforcment Learning


This part implementation comes from chapter 10 from the  book 
[Deep Reinforcement Learning Hands-On - Second Edition by Maxim Lapan](https://www.amazon.com/Deep-Reinforcement-Learning-Hands-optimization/dp/1838826998/ref=asc_df_1838826998/?tag=hyprod-20&linkCode=df0&hvadid=416741343328&hvpos=&hvnetw=g&hvrand=7234438034400691228&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9008183&hvtargid=pla-871456510229&psc=1&tag=&ref=&adgrpid=93867144477&hvpone=&hvptwo=&hvadid=416741343328&hvpos=&hvnetw=g&hvrand=7234438034400691228&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9008183&hvtargid=pla-871456510229)

As stated in chapter 10: 

> Rather than learning new methods to solve toy reinforcement learning (RL) problems in this chapter, we will try to utilize our deep Q-network (DQN) knowledge to deal with the much more practical problem of financial trading. 

Namely, a RL agent has some observation of the market, and has to take an action to either buy, sell, or hold. If the agent buys before the price goes up, profit will be positive; otherwise, the agent will get a negative reward. The agent is tyring to obtain as much profit as possible in the trading environment. 



In [None]:
!pip install ptan

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install tensorboardX

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import numpy as np
import torch 
import torch.nn as nn
import torch.nn.functional as F
import gym
import gym.spaces
from gym.utils import seeding
from gym.envs.registration import EnvSpec
import enum
import glob
import os
import collections
import csv
import sys
import time
import ptan
import torch.optim as optim
from tensorboardX import SummaryWriter

# (Optional): Validation 

In [None]:
# validation function stuff
def validation_run(env, net, episodes=100, device="cpu", epsilon=0.02, comission=0.1):
    stats = {
        'episode_reward': [],
        'episode_steps': [],
        'order_profits': [],
        'order_steps': [],
    }

    for episode in range(episodes):
        obs = env.reset()

        total_reward = 0.0
        position = None
        position_steps = None
        episode_steps = 0

        while True:
            obs_v = torch.tensor([obs]).to(device)
            out_v = net(obs_v)

            action_idx = out_v.max(dim=1)[1].item()
            if np.random.random() < epsilon:
                action_idx = env.action_space.sample()
            action = Actions(action_idx)

            close_price = env._state._cur_close()

            if action == Actions.Buy and position is None:
                position = close_price
                position_steps = 0
            elif action == Actions.Close and position is not None:
                profit = close_price - position - (close_price + position) * comission / 100
                profit = 100.0 * profit / position
                stats['order_profits'].append(profit)
                stats['order_steps'].append(position_steps)
                position = None
                position_steps = None

            obs, reward, done, _ = env.step(action_idx)
            total_reward += reward
            episode_steps += 1
            if position_steps is not None:
                position_steps += 1
            if done:
                if position is not None:
                    profit = close_price - position - (close_price + position) * comission / 100
                    profit = 100.0 * profit / position
                    stats['order_profits'].append(profit)
                    stats['order_steps'].append(position_steps)
                break

        stats['episode_reward'].append(total_reward)
        stats['episode_steps'].append(episode_steps)

    return { key: np.mean(vals) for key, vals in stats.items() }

# Price Data for Trading Environment

The chapter uses Russian stock market prices from the period ranging from 2015-2016 for the technology company [Yandex](https://en.wikipedia.org/wiki/Yandex) for its reinforcment trading agent. It contained over 130,000 rows, where every row represented a single minute in time,and price movement during that minute was captured by five variables: open, high, low,close, and volume. 

Rather than use one stock, I decided to use a basket of stocks found in the [SPY ETF](https://www.etf.com/SPY#:~:text=SPY%20is%20the%20best%2Drecognized,US%20index%2C%20the%20S%26P%20500.). This would give a longer term trading horizon, rather than the trading horizon  provided by the Yandex data. The period ranged from 2005 to 2022. Each row represented a single trading day and price movement during the trading day was captured by five variables: open, high, low, close, and volume. 

In [None]:
Prices = collections.namedtuple('Prices', field_names=['open', 'high', 'low', 'close', 'volume'])


def read_csv(file_name, sep=',', filter_data=True, fix_open_prices=False):
  print("Reading", file_name)
  with open(file_name, 'r') as fd:
    reader = csv.reader(fd)
    h = next(reader)
    indices = [h.index(s) for s in ('Open', 'High', 'Low', 'Close', 'Volume')]
    o, h, l, c, v = [], [], [], [], []
    count_out = 0
    count_filter = 0 
    count_fixed = 0
    prev_vals = None
    for row in reader:
      vals = list(map(float, [row[idx] for idx in indices])) 
      if filter_data and all(map(lambda v: abs(v-vals[0]) < 1e-8, vals[:-1])):
        count_filter += 1
        continue
      
      po, ph, pl, pc, pv = vals
      
      # putting price data into list and then into a np.array 
      # where o is open price, c is close price, h is high price, l 
      # is low price, and v is volume
      count_out +=1
      o.append(po)
      c.append(pc)
      h.append(ph)
      l.append(pl)
      v.append(pv)
      prev_vals = vals
  return Prices(open=np.array(o, dtype=np.float32),high=np.array(h, dtype=np.float32), low=np.array(l, dtype=np.float32),close=np.array(c, dtype=np.float32), volume=np.array(v, dtype=np.float32))


# prices(object): of collections.namedtuple type
def prices_to_relative(prices):
    """
    Convert prices to relative in respect to open price
    :param ochl: tuple with open, close, high, low
    :return: tuple with open, rel_close, rel_high, rel_low
    """
    assert isinstance(prices, Prices)
    rh = (prices.high - prices.open) / prices.open
    rl = (prices.low - prices.open) / prices.open
    rc = (prices.close - prices.open) / prices.open
    return Prices(open=prices.open, high=rh, low=rl, close=rc, volume=prices.volume)

def load_relative(csv_file):
    return prices_to_relative(read_csv(csv_file))


# Creating Trading Environment using gym.Env Class 





In [None]:
# defualt number of past trading days agent can observe when taking action;
# for 1D convolution model, this is the column portion of 2D matrix
DEFAULT_BARS_COUNT = 2
# default percentage of stock price trading agent pays broker when 
# buying/selling, default is 0.1% (i.e. very reasonable)
DEFAULT_COMMISSION_PERC = 0.1


# Actions
class Actions(enum.Enum):
  # actions agent can take when trading
  Skip = 0 
  Buy = 1
  Close = 2

# StocksEnv
class StocksEnv(gym.Env):

  # fields required by gym.Env
  metadata = {'render.modes': ['human']}
  spec = EnvSpec("SPYEnv-v0")

  def __init__(self, prices, bars_count=DEFAULT_BARS_COUNT, commission=DEFAULT_COMMISSION_PERC,
               reset_on_close=True, state_1d=False, random_ofs_on_reset=True,
               reward_on_close=False, volumes=True):
    assert isinstance(prices, dict)
    self._prices = prices

#---------------------State-Observation Encoding Section------------------------    
    # key!!!: creating the state observation for trading agent; when using
    # the 1D convolutional model, encoding must be of State1D class!!!!
    if state_1d:
      self._state = State1D(bars_count, commission, reset_on_close, reward_on_close=reward_on_close, volumes=volumes)
    else:
      self._state = State(bars_count, commission, reset_on_close, reward_on_close=reward_on_close, volumes=volumes)
    
    # creating action space for trading agent
    self.action_space = gym.spaces.Discrete(n=len(Actions))
    
    # creating observation space for training agent
    self.observation_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=self._state.shape, dtype=np.float32)
    
    # decide if want to use random offset, default is True
    self.random_ofs_on_reset = random_ofs_on_reset
    self.seed()

#------------------------Reset Section------------------------------------------
  # creates the offset for time series data (i.e. not 
  # always starting at the beggining of the time series data for episode)
  def reset(self):
    self._instrument = self.np_random.choice(list(self._prices.keys()))
    prices = self._prices[self._instrument]
    bars = self._state.bars_count
    if self.random_ofs_on_reset:
      offset = self.np_random.choice(prices.high.shape[0]-bars*10)+bars
    else:
      offset = bars
    self._state.reset(prices, offset)
    return self._state.encode()  

#-----------------------Step Section--------------------------------------------
  # executes the sequence of agent taking action, getting reward and
  # then getting the next observation/state 
  def step(self, action_idx):
    action = Actions(action_idx)
    reward, done = self._state.step(action)
    obs = self._state.encode()
    info = {"instrument":self._instrument, "offset": self._state._offset}
    return obs, reward, done, info

#----------------------Render Section-------------------------------------------
  # required by gym.Env object; future will implement the render method to view
  # the observation space of agent when trading to compare analysis
  def render(self, mode='human', close=False):
    pass
  def close(self):
    pass

  def seed(self, seed=None):
    self.np_random, seed1 = seeding.np_random(seed)
    seed2 = seeding.hash_seed(seed1+1) % 2**33
    return [seed1, seed2]

#----------------------Class Method Section-------------------------------------
  # creates the instance of the  StocksEnv object to play with!!!
  @classmethod
  def from_dir(cls, data_dir, **kwargs):
    prices = {f: load_relative(f) for f in price_files(data_dir)}
    return StocksEnv(prices, **kwargs)

# State Space

As detailed by Maxim Lapan in the [book](https://www.amazon.com/Deep-Reinforcement-Learning-Hands-optimization/dp/1838826998/ref=asc_df_1838826998/?tag=hyprod-20&linkCode=df0&hvadid=416741343328&hvpos=&hvnetw=g&hvrand=7234438034400691228&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9008183&hvtargid=pla-871456510229&psc=1&tag=&ref=&adgrpid=93867144477&hvpone=&hvptwo=&hvadid=416741343328&hvpos=&hvnetw=g&hvrand=7234438034400691228&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9008183&hvtargid=pla-871456510229), the reward is of ***either/or form***. Namely

>  We can split the reward into multiple steps during our ownership of the share. In that case, the reward on every step will be equal to the last bar's movement. On the other hand, the agent will receive the reward only after the close action and receive the full reward at once. At first sight, both variants should have the same final result, but maybe with different convergence speeds. However, in practice, the difference could be dramatic. We will implement both variants to compare them.

This ***either/or form*** of the reward is done by setting the variable reward_on_close to either True or False. As the book details

> The reward_on_close is a Boolean parameter that switches between the two reward schemes discussed previously. If it is set to True, the agent will receive a reward only on the "close" action issue. Otherwise, we will give a small reward every bar, corresponding to price movement during that bar.

The default setting is True for reward_on_close which amounts to the trading strategy of [selling high](https://www.investopedia.com/articles/investing/081415/look-buy-low-sell-high-strategy.asp). Changing reward_on_close to False amounts to the trading strategy of [buying low](https://www.investopedia.com/articles/investing/081415/look-buy-low-sell-high-strategy.asp)


In [None]:
# General State Class (i.e. models not based on convolutions)
class State:
  def __init__(self, bars_count, commission_perc, reset_on_close, reward_on_close=True, volumes=True):
    # checking bars_count is an int
    assert isinstance(bars_count, int)
    # checking that bars_count is greater than zero
    assert bars_count > 0
    # checking commission is a float
    assert isinstance(commission_perc, float)
    # checking commission is greater than zero
    assert commission_perc >= 0.0
    # checking that reset_on_close and reward on close are bools
    assert isinstance(reset_on_close, bool)
    assert isinstance(reward_on_close, bool)
    self.bars_count=bars_count
    self.commission_perc = commission_perc
    self.reset_on_close = reset_on_close
    self.reward_on_close = reward_on_close
    self.volumes = volumes
    self.previous_close = 0.0
  
  # method that reset's the environment 
  def reset(self, prices, offset):
    assert isinstance(prices, Prices)
    assert offset >= self.bars_count-1
    self.have_position = False  # start with no stocks
    self.open_price = 0.0
    self._prices = prices
    self._offset = offset

  # the shape of the state
  @property
  def shape(self):
    # the shape is the high, low, and closing prices of the current trading day
    # (i.e. 3 or 4 if volume is used) times the num of bars
    # (i.e. past prices agent can observe) plus the position flag 
    # (i.e. whether agent is holding onto the stock or not) and 
    # the relative profit agent has recieved since opening
    if self.volumes:
      return (4*self.bars_count+1+1, )
    else:
      return (3*self.bars_count+1+1, )
  
  # method that encodes the current state
  def encode(self):
    res = np.ndarray(shape=self.shape, dtype=np.float32)
    shift = 0
    for bar_idx in range(-self.bars_count+1, 1):
      res[shift] = self._prices.high[self._offset + bar_idx]
      shift += 1
      res[shift] = self._prices.low[self._offset + bar_idx]
      shift += 1
      res[shift] = self._prices.close[self._offset + bar_idx]
      shift += 1
      if self.volumes:
        res[shift] = self._prices.volume[self._offset + bar_idx]
        shift += 1
    res[shift] = float(self.have_position)
    shift += 1
    if not self.have_position:
      res[shift] = 0.0
    else:
      res[shift] = (self._cur_close() - self.open_price) / self.open_price
    return res
 
  def _cur_close(self):
    """
    Calculate real close price for the current bar
    """
    open = self._prices.open[self._offset]
    rel_close = self._prices.close[self._offset]
    return open * (1.0 + rel_close)

  def _cur_open(self):
    """
    Calculate real open price for the current bar
    """
    open = self._prices.open[self._offset]
    return open

   def _previous_close(self):
    """
    Calculate previous close price for the past bar
    """
    open = self._prices.open[-self._offset]
    rel_close = self._prices.close[-self._offset]
    return open * (1.0 + rel_close)
 


#---------------!!!Step Section & Reward Calculation!!!------------------------- 
  def step(self, action):
      """
      Perform one step in our price, adjust offset, check for the end of prices
      and handle position change
      :param action:
      :return: reward, done
      """
      assert isinstance(action, Actions)
      reward = 0.0
      done = False
      close = self._cur_close()
      open = self._cur_open()

      if action == Actions.Buy and not self.have_position:
        self.have_position = True
        self.open_price = close
        reward -= self.commission_perc
      elif action == Actions.Close and self.have_position:
        reward -= self.commission_perc
        done |= self.reset_on_close
        """
        implements sell high, buy low  trading strategy, since
        positive reward is given when selling higher than previous open
        and a negative reward is given when selling lower than previous open
        """
        if self.reward_on_close:
          reward += 100.0 * (close - self.open_price) / self.open_price
        self.have_position = False
        self.open_price = 0.0

      self._offset += 1
      prev_close = self._previous_close()
      close = self._cur_close()
      done |= self._offset >= self._prices.close.shape[0]-1

      """
      implements buy low, sell high trading strategy, since
      positive reward is given when buying lower than previous open
      and a negative reward is given when buying higher than previous open
      """
      if self.have_position and not self.reward_on_close:
        reward += 100.0 * (prev_close-open) / prev_close

      return reward, done

# Specific State Class for encoding observation space for convolution models
class State1D(State):
    """
    State with shape suitable for 1D convolution, must be 2D of form 
    (row, column) where row is either 5 for just price data or 6 if agent can
    also observe volume information and column is number of past trading days
    agent can observe
    """
    @property
    def shape(self):
        if self.volumes:
            return (6, self.bars_count)
        else:
            return (5, self.bars_count)

    def encode(self):
        res = np.zeros(shape=self.shape, dtype=np.float32)
        ofs = self.bars_count-1
        res[0] = self._prices.high[self._offset-ofs:self._offset+1]
        res[1] = self._prices.low[self._offset-ofs:self._offset+1]
        res[2] = self._prices.close[self._offset-ofs:self._offset+1]
        if self.volumes:
            res[3] = self._prices.volume[self._offset-ofs:self._offset+1]
            dst = 4
        else:
            dst = 3
        if self.have_position:
            res[dst] = 1.0
            res[dst+1] = (self._cur_close() - self.open_price) / self.open_price
        return res

# Creating the Dueling DQN Model using 1D Convolutions for first transformation (i.e. cross-correlation) and then linear transformations for feed-forward value and advantage calculation 


In [None]:
class DQNConv1D(nn.Module):
    def __init__(self, shape, actions_n):
        super(DQNConv1D, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv1d(shape[0], 128, 1),
            nn.ReLU(),
            nn.Conv1d(128, 128, 1),
            nn.ReLU(),
        )

        out_size = self._get_conv_out(shape)

        self.fc_val = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        self.fc_adv = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        val = self.fc_val(conv_out)
        adv = self.fc_adv(conv_out)
        return val + adv - adv.mean(dim=1, keepdim=True)



# Training the Trading Agent with the RewardTracker Class (i.e. tracks agent status in environment)

In [None]:
class RewardTracker:

    # stop_reward(int): reward stopping threshold for agent, defualt is 1 
    # group_rewards(int): trading period, default is 1 trading day
    def __init__(self, writer, stop_reward=1, group_rewards=1):
        self.writer = writer
        self.stop_reward = stop_reward
        self.reward_buf = []
        self.steps_buf = []
        self.group_rewards = group_rewards

    def __enter__(self):
        self.ts = time.time()
        self.ts_frame = 0
        self.total_rewards = []
        self.total_steps = []
        return self

    def __exit__(self, *args):
        self.writer.close()

#-----------------------------Reward Section------------------------------------
    def reward(self, reward_steps, frame, epsilon=None):
        
        # Reward Per Steps
        reward, steps = reward_steps
       
        self.reward_buf.append(reward)
        self.steps_buf.append(steps)
        if len(self.reward_buf) < self.group_rewards:
            return False
        # calculates mean reward from buffer 
        reward = np.mean(self.reward_buf)
        steps = np.mean(self.steps_buf)
        self.reward_buf.clear()
        self.steps_buf.clear()

        # Total Rewards
        self.total_rewards.append(reward)

        # Total Steps
        self.total_steps.append(steps)

        speed = (frame - self.ts_frame) / (time.time() - self.ts)
        self.ts_frame = frame
        self.ts = time.time()

        # Mean Reward 
        mean_reward = np.mean(self.total_rewards[-self.group_rewards:]) 

        mean_steps = np.mean(self.total_steps[-self.group_rewards:])
        epsilon_str = "" if epsilon is None else ", eps %.2f" % epsilon
        print("Trading Experience XP: %d || Mean Reward Per %d Trading Days: %.3f"  % (len(self.total_rewards)*self.group_rewards, self.group_rewards, mean_reward))
        sys.stdout.flush()
        if epsilon is not None:
            self.writer.add_scalar("epsilon", epsilon, frame)
        self.writer.add_scalar("speed", speed, frame)
        self.writer.add_scalar("reward_per_tradingWindow", mean_reward, frame)
        self.writer.add_scalar("steps_per_tradingWindow", mean_steps, frame)
        if mean_reward > self.stop_reward:
            print("Kid, you're on a roll. Enjoy it while it lasts, 'cause it never does.")
            return True
        return False

In [None]:
#----------------------Parameters Section---------------------------------------
BATCH_SIZE = 32
TARGET_NET_SYNC = 1000
GAMMA = 0.99
REWARD_STEPS = 2
REPLAY_SIZE = 100000
REPLAY_INITIAL = 10000
LEARNING_RATE = 0.0001
EPSILON_START = 1.0
EPSILON_STOP = 0.1
EPSILON_STEPS = 1000000
CHECKPOINT_EVERY_STEP = 1000000
VALIDATION_EVERY_STEP = 100000
CUDA = True
YEAR = None

STATES_TO_EVALUATE = 1000
EVAL_EVERY_STEP = 1000
MAX_EPISODES = 1000
WINDOW = 5
STOP_REWARD = 2e6
BARS_COUNT = 2

#----------------------Storage-Path Section-------------------------------------
DEFAULT_STOCKS = "/content/drive/MyDrive/Datasets/SPY/spy_past.csv"
DEFAULT_VAL_STOCKS = "/content/drive/MyDrive/Datasets/SPY/spy_future.csv"
SAVE_PATH = "saves"

#----------------------Loss Calculation Section---------------------------------
def calc_values_of_states(states, net, device="cpu"):
    mean_vals = []
    for batch in np.array_split(states, 64):
        states_v = torch.tensor(batch).to(device)
        action_values_v = net(states_v)
        best_action_values_v = action_values_v.max(1)[0]
        mean_vals.append(best_action_values_v.mean().item())
    return np.mean(mean_vals)


def unpack_batch(batch):
    states, actions, rewards, dones, last_states = [], [], [], [], []
    for exp in batch:
        state = np.array(exp.state, copy=False)
        states.append(state)
        actions.append(exp.action)
        rewards.append(exp.reward)
        dones.append(exp.last_state is None)
        if exp.last_state is None:
            last_states.append(state)       # the result will be masked anyway
        else:
            last_states.append(np.array(exp.last_state, copy=False))
    return np.array(states, copy=False), np.array(actions), np.array(rewards, dtype=np.float32), \
           np.array(dones, dtype=np.uint8), np.array(last_states, copy=False)


def calc_loss(batch, net, tgt_net, gamma, device="cpu"):
    states, actions, rewards, dones, next_states = unpack_batch(batch)

    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.ByteTensor(dones).to(device)

    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    next_state_actions = net(next_states_v).max(1)[1]
    next_state_values = tgt_net(next_states_v).gather(1, next_state_actions.unsqueeze(-1)).squeeze(-1)
    next_state_values[done_mask] = 0.0

    expected_state_action_values = next_state_values.detach() * gamma + rewards_v
    return nn.MSELoss()(state_action_values, expected_state_action_values)

#-----------------------------Main Section--------------------------------------
if __name__ == "__main__":
    device = torch.device("cuda" if CUDA else "cpu")
    saves_path = os.path.join("/content/", SAVE_PATH)
    os.makedirs(saves_path, exist_ok=True)

    # load training and test data and create trading environment
    if YEAR is not None or os.path.isfile(DEFAULT_STOCKS):
        if YEAR is not None:
            stock_data = data.load_year_data(YEAR)
        else:
            stock_data = {"SPY": load_relative(DEFAULT_STOCKS)}
        env = StocksEnv(stock_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=True, volumes=True)
        env_tst = StocksEnv(stock_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=True)
    elif os.path.isdir(DEFAULT_STOCKS):
        env = StocksEnv.from_dir(DEFAULT_STOCKS, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False)
        env_tst = StocksEnv.from_dir(DEFAULT_STOCKS, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False)
    else:
        raise RuntimeError("No data to train on")
    
    # episode time limit for agent, defualt is 1000 steps 
    env = gym.wrappers.TimeLimit(env, max_episode_steps=MAX_EPISODES)
    
    # loading validation data and creating tradining environment  
    val_data = {"SPY": load_relative(DEFAULT_VAL_STOCKS)}
    env_val = StocksEnv(val_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=True)

    # tensorboard stuff
    writer = SummaryWriter(comment="-simple-" + "run")

    # creating the 1D convolution model
    net = DQNConv1D(env.observation_space.shape, env.action_space.n).to(device)
    # Adam optimizer
    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)  

#----------------------------Ptan Section---------------------------------------    
    # https://github.com/Shmuma/ptan/blob/master/ptan/agent.py
    tgt_net = ptan.agent.TargetNet(net)
    # https://github.com/Shmuma/ptan/blob/master/ptan/actions.py
    selector = ptan.actions.EpsilonGreedyActionSelector(EPSILON_START)
    # https://github.com/Shmuma/ptan/blob/master/ptan/agent.py
    agent = ptan.agent.DQNAgent(net, selector, device=device)
    # https://github.com/Shmuma/ptan/blob/master/ptan/experience.py
    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, GAMMA, steps_count=REWARD_STEPS)
    # https://github.com/Shmuma/ptan/blob/master/ptan/experience.py
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, REPLAY_SIZE)

#-----------------------RewardTracker/Training Loop Section---------------------
    # initialization
    step_idx = 0
    eval_states = None
    best_mean_val = None
    
    # RewardTracker
    with RewardTracker(writer, stop_reward=STOP_REWARD, group_rewards=WINDOW) as reward_tracker:
        while True:
            step_idx += 1
            buffer.populate(1)
            selector.epsilon = max(EPSILON_STOP, EPSILON_START - step_idx / EPSILON_STEPS)

            new_rewards = exp_source.pop_rewards_steps()
            if new_rewards:
                reward_tracker.reward(new_rewards[0], step_idx, selector.epsilon)

            if len(buffer) < REPLAY_INITIAL:
                continue

            if eval_states is None:
                print("Initial buffer populated, start training")
                eval_states = buffer.sample(STATES_TO_EVALUATE)
                eval_states = [np.array(transition.state, copy=False) for transition in eval_states]
                eval_states = np.array(eval_states, copy=False)

            if step_idx % EVAL_EVERY_STEP == 0:
                mean_val = calc_values_of_states(eval_states, net, device=device)
                writer.add_scalar("values_mean", mean_val, step_idx)
                if best_mean_val is None or best_mean_val < mean_val:
                    if best_mean_val is not None:
                        print("%d: Best mean value updated %.3f -> %.3f" % (step_idx, best_mean_val, mean_val))
                    best_mean_val = mean_val
                    torch.save(net.state_dict(), os.path.join(saves_path, "mean_val-%.3f.data" % mean_val))
                    

            optimizer.zero_grad()
            batch = buffer.sample(BATCH_SIZE)

            # Loss calcuation, backprop and gradient descent stuff
            loss_v = calc_loss(batch, net, tgt_net.target_model, GAMMA ** REWARD_STEPS, device=device)
            loss_v.backward()
            optimizer.step()

            if step_idx % TARGET_NET_SYNC == 0:
                tgt_net.sync()

            if step_idx % CHECKPOINT_EVERY_STEP == 0:
                idx = step_idx // CHECKPOINT_EVERY_STEP
                torch.save(net.state_dict(), os.path.join(saves_path, "checkpoint-%3d.data" % idx))

            if step_idx % VALIDATION_EVERY_STEP == 0:
                res = validation_run(env_tst, net, device=device)
                for key, val in res.items():
                    writer.add_scalar(key + "_test", val, step_idx)
                res = validation_run(env_val, net, device=device)
                for key, val in res.items():
                    writer.add_scalar(key + "_val", val, step_idx)

Reading /content/drive/MyDrive/Datasets/SPY/spy_past.csv
Reading /content/drive/MyDrive/Datasets/SPY/spy_future.csv


UnboundLocalError: ignored

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs/

In [None]:
!zip -r /content/runs.zip /content/runs/