<a href="https://colab.research.google.com/github/aCStandke/ReinforcementLearning/blob/main/SpyTradingAgent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stocks Trading Using Reinforcment Learning


This implementation comes straight from chapter 10 from the  book 
[Deep Reinforcement Learning Hands-On - Second Edition by Maxim Lapan](https://www.amazon.com/Deep-Reinforcement-Learning-Hands-optimization/dp/1838826998/ref=asc_df_1838826998/?tag=hyprod-20&linkCode=df0&hvadid=416741343328&hvpos=&hvnetw=g&hvrand=7234438034400691228&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9008183&hvtargid=pla-871456510229&psc=1&tag=&ref=&adgrpid=93867144477&hvpone=&hvptwo=&hvadid=416741343328&hvpos=&hvnetw=g&hvrand=7234438034400691228&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9008183&hvtargid=pla-871456510229)

As stated in chapter 10: 

> Rather than learning new methods to solve toy reinforcement learning (RL) problems in this chapter, we will try to utilize our deep Q-network (DQN) knowledge to deal with the much more practical problem of financial trading. 

Namely, a RL agent has some observation of the market, and has to take an action to either buy, sell, or hold. If the agent buys before the price goes up, profit will be positive; otherwise, the agent will get a negative reward. The agent is tyring to obtain as much profit as possible in the trading environment. 



In [1]:
!pip install ptan

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ptan
  Downloading ptan-0.7.tar.gz (20 kB)
Collecting torch==1.7.0
  Downloading torch-1.7.0-cp37-cp37m-manylinux1_x86_64.whl (776.7 MB)
[K     |████████████████████████████████| 776.7 MB 3.7 kB/s 
Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Building wheels for collected packages: ptan
  Building wheel for ptan (setup.py) ... [?25l[?25hdone
  Created wheel for ptan: filename=ptan-0.7-py3-none-any.whl size=23505 sha256=c8de552d0cccfde12b797d6d25b1763c0075d052195555f7bd6ffb5c0d85d61c
  Stored in directory: /root/.cache/pip/wheels/60/72/3d/a3c47193fdb9efd08e3a54398af996b2989c68571813a71256
Successfully built ptan
Installing collected packages: dataclasses, torch, ptan
  Attempting uninstall: torch
    Found existing installation: torch 1.12.0+cu113
    Uninstalling torch-1.12.0+cu113:
      Successfully uninstalled torch-1.12.0+cu113
[31mERROR

In [2]:
!pip install tensorboardX

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorboardX
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 14.4 MB/s 
Installing collected packages: tensorboardX
Successfully installed tensorboardX-2.5.1


In [3]:
import pandas as pd
import numpy as np
import torch 
import torch.nn as nn
import torch.nn.functional as F
import gym
import gym.spaces
from gym.utils import seeding
from gym.envs.registration import EnvSpec
import enum
import glob
import os
import collections
import csv
import sys
import time

In [4]:
# method for validation runs
def validation_run(env, net, episodes=100, device="cpu", epsilon=0.02, comission=0.1):
    stats = {
        'episode_reward': [],
        'episode_steps': [],
        'order_profits': [],
        'order_steps': [],
    }

    for episode in range(episodes):
        obs = env.reset()

        total_reward = 0.0
        position = None
        position_steps = None
        episode_steps = 0

        while True:
            obs_v = torch.tensor([obs]).to(device)
            out_v = net(obs_v)

            action_idx = out_v.max(dim=1)[1].item()
            if np.random.random() < epsilon:
                action_idx = env.action_space.sample()
            action = Actions(action_idx)

            close_price = env._state._cur_close()

            if action == Actions.Buy and position is None:
                position = close_price
                position_steps = 0
            elif action == Actions.Close and position is not None:
                profit = close_price - position - (close_price + position) * comission / 100
                profit = 100.0 * profit / position
                stats['order_profits'].append(profit)
                stats['order_steps'].append(position_steps)
                position = None
                position_steps = None

            obs, reward, done, _ = env.step(action_idx)
            total_reward += reward
            episode_steps += 1
            if position_steps is not None:
                position_steps += 1
            if done:
                if position is not None:
                    profit = close_price - position - (close_price + position) * comission / 100
                    profit = 100.0 * profit / position
                    stats['order_profits'].append(profit)
                    stats['order_steps'].append(position_steps)
                break

        stats['episode_reward'].append(total_reward)
        stats['episode_steps'].append(episode_steps)

    return { key: np.mean(vals) for key, vals in stats.items() }

In [3]:
# tracks the rewards of the trading agent
class RewardTracker:
    def __init__(self, writer, stop_reward, group_rewards=1):
        self.writer = writer
        self.stop_reward = stop_reward
        self.reward_buf = []
        self.steps_buf = []
        self.group_rewards = group_rewards

    def __enter__(self):
        self.ts = time.time()
        self.ts_frame = 0
        self.total_rewards = []
        self.total_steps = []
        return self

    def __exit__(self, *args):
        self.writer.close()

    def reward(self, reward_steps, frame, epsilon=None):
        # gets the reward for num of steps agent took
        reward, steps = reward_steps
        # stores the reward in buffer
        self.reward_buf.append(reward)
        self.steps_buf.append(steps)
        if len(self.reward_buf) < self.group_rewards:
            return False
        # calculates mean reward from buffer 
        reward = np.mean(self.reward_buf)
        steps = np.mean(self.steps_buf)
        self.reward_buf.clear()
        self.steps_buf.clear()
        # stores the total reward
        self.total_rewards.append(reward)
        self.total_steps.append(steps)
        speed = (frame - self.ts_frame) / (time.time() - self.ts)
        self.ts_frame = frame
        self.ts = time.time()
        mean_reward = np.mean(self.total_rewards[-100:]) # 
        mean_steps = np.mean(self.total_steps[-100:])
        epsilon_str = "" if epsilon is None else ", eps %.2f" % epsilon
        # print("%d: done %d games, mean reward %.3f, mean steps %.2f, speed %.2f f/s%s" % (
        #     frame, len(self.total_rewards)*self.group_rewards, mean_reward, mean_steps, speed, epsilon_str
        # ))
        print("mean reward %.3f, mean steps %.2f"  % (mean_reward, mean_steps))
        sys.stdout.flush()
        if epsilon is not None:
            self.writer.add_scalar("epsilon", epsilon, frame)
        self.writer.add_scalar("speed", speed, frame)
        self.writer.add_scalar("reward_100", mean_reward, frame)
        self.writer.add_scalar("reward", reward, frame)
        self.writer.add_scalar("steps_100", mean_steps, frame)
        self.writer.add_scalar("steps", steps, frame)
        if mean_reward > self.stop_reward:
            print("Solved in %d frames!" % frame)
            return True
        return False


def calc_values_of_states(states, net, device="cpu"):
    mean_vals = []
    for batch in np.array_split(states, 64):
        states_v = torch.tensor(batch).to(device)
        action_values_v = net(states_v)
        best_action_values_v = action_values_v.max(1)[0]
        mean_vals.append(best_action_values_v.mean().item())
    return np.mean(mean_vals)


def unpack_batch(batch):
    states, actions, rewards, dones, last_states = [], [], [], [], []
    for exp in batch:
        state = np.array(exp.state, copy=False)
        states.append(state)
        actions.append(exp.action)
        rewards.append(exp.reward)
        dones.append(exp.last_state is None)
        if exp.last_state is None:
            last_states.append(state)       # the result will be masked anyway
        else:
            last_states.append(np.array(exp.last_state, copy=False))
    return np.array(states, copy=False), np.array(actions), np.array(rewards, dtype=np.float32), \
           np.array(dones, dtype=np.uint8), np.array(last_states, copy=False)


def calc_loss(batch, net, tgt_net, gamma, device="cpu"):
    states, actions, rewards, dones, next_states = unpack_batch(batch)

    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.ByteTensor(dones).to(device)

    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    next_state_actions = net(next_states_v).max(1)[1]
    next_state_values = tgt_net(next_states_v).gather(1, next_state_actions.unsqueeze(-1)).squeeze(-1)
    next_state_values[done_mask] = 0.0

    expected_state_action_values = next_state_values.detach() * gamma + rewards_v
    return nn.MSELoss()(state_action_values, expected_state_action_values)



# Price Data for Trading Environment

The chapter uses Russian stock market prices from the period ranging from 2015-2016 for the technology company [Yandex](https://en.wikipedia.org/wiki/Yandex) for its reinforcment trading agent. It contained over 130,000 rows, where every row represented a single minute in time,and price movement during that minute was captured by five variables: open, high, low,close, and volume. 

Rather than use one stock, I decided to use a basket of stocks found in the [SPY ETF](https://www.etf.com/SPY#:~:text=SPY%20is%20the%20best%2Drecognized,US%20index%2C%20the%20S%26P%20500.). This would give a longer term trading horizon, rather than the trading horizon  provided by the Yandex data. The period ranged from 2005 to 2022. Each row represented a single trading day and price movement during the trading day was captured by five variables: open, high, low, close, and volume. 

In [6]:
Prices = collections.namedtuple('Prices', field_names=['open', 'high', 'low', 'close', 'volume'])

def read_csv(file_name, sep=',', filter_data=True, fix_open_prices=False):
  print("Reading", file_name)
  with open(file_name, 'r') as fd:
    reader = csv.reader(fd)
    h = next(reader)
    indices = [h.index(s) for s in ('Open', 'High', 'Low', 'Close', 'Volume')]
    o, h, l, c, v = [], [], [], [], []
    count_out = 0
    count_filter = 0 
    count_fixed = 0
    prev_vals = None
    for row in reader:
      vals = list(map(float, [row[idx] for idx in indices])) 
      if filter_data and all(map(lambda v: abs(v-vals[0]) < 1e-8, vals[:-1])):
        count_filter += 1
        continue
      
      po, ph, pl, pc, pv = vals

      count_out +=1
      o.append(po)
      c.append(pc)
      h.append(ph)
      l.append(pl)
      v.append(pv)
      prev_vals = vals
  #print("Read done, got %d rows, %d filtered, %d open prices adjusted" % (count_filter+count_out, count_filter, count_fixed))
  return Prices(open=np.array(o, dtype=np.float32),high=np.array(h, dtype=np.float32), low=np.array(l, dtype=np.float32),close=np.array(c, dtype=np.float32), volume=np.array(v, dtype=np.float32))

# Key: agent learns relative movement, rather than actual price values
def prices_to_relative(prices):
    """
    Convert prices to relative in respect to open price
    :param ochl: tuple with open, close, high, low
    :return: tuple with open, rel_close, rel_high, rel_low
    """
    assert isinstance(prices, Prices)
    rh = (prices.high - prices.open) / prices.open
    rl = (prices.low - prices.open) / prices.open
    rc = (prices.close - prices.open) / prices.open
    return Prices(open=prices.open, high=rh, low=rl, close=rc, volume=prices.volume)

def load_relative(csv_file):
    return prices_to_relative(read_csv(csv_file))


# Creating the Action Space

In [7]:
# sets the actions trading agent can take when trading 
class Actions(enum.Enum):
  Nothing = 0
  Buy = 1
  Close = 2

# Creating Trading Environment 





In [48]:
# Key!!! number of past trading days agent can observe, which for 1D 
# convolution on a 2D matrix is the column portion
DEFAULT_BARS_COUNT = 2
# percentage of stock price trading agent pays broker on buying/selling SPY. By default, it's 0.1%.
DEFAULT_COMMISSION_PERC = 0.1

class StocksEnv(gym.Env):
  # fields required by gym.Env
  metadata = {'render.modes': ['human']}
  spec = EnvSpec("SPYEnv-v0")

  # constructor of the environment
  def __init__(self, prices, bars_count=DEFAULT_BARS_COUNT, commission=DEFAULT_COMMISSION_PERC,
               reset_on_close=True, state_1d=False, random_ofs_on_reset=True,
               reward_on_close=False, volumes=True):
    # check to see stock prices is a dict data structure
    assert isinstance(prices, dict)
    self._prices = prices
    
    # important: creating the state object for the trading agent
    if state_1d:
      self._state = State1D(bars_count, commission, reset_on_close, reward_on_close=reward_on_close, volumes=volumes)
    else:
      self._state = State(bars_count, commission, reset_on_close, reward_on_close=reward_on_close, volumes=volumes)
    
    # creating discrete action space for trading agent
    self.action_space = gym.spaces.Discrete(n=len(Actions))
    
    # creating observation space for training agent
    # i.e. a (possibly unbounded) box in R^n. Specifically, a Box represents the 
    # Cartesian product of n closed intervals which in this case is (-inf, inf)
    self.observation_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=self._state.shape, dtype=np.float32)
    
    # if true, on every reset of the environment, the random offset in the time series will be chosen. 
    # Otherwise,  it will start from the beginning of the data.
    self.random_ofs_on_reset = random_ofs_on_reset
    self.seed()

  # important: creates the offset for time series data (i.e. not 
  # always starting at the beggining of the time series data)
  def reset(self):
    self._instrument = self.np_random.choice(list(self._prices.keys()))
    prices = self._prices[self._instrument]
    bars = self._state.bars_count
    if self.random_ofs_on_reset:
      offset = self.np_random.choice(prices.high.shape[0]-bars*10)+bars
    else:
      offset = bars
    self._state.reset(prices, offset)
    return self._state.encode()  

  # important: executes the sequence of agent taking action, getting reward and
  # then getting the next observation/state 
  def step(self, action_idx):
    action = Actions(action_idx)
    reward, done = self._state.step(action)
    obs = self._state.encode()
    info = {"instrument":self._instrument, "offset": self._state._offset}
    return obs, reward, done, info

  # methods required by gym.Env; future will implement the render method to view
  # the observation space of agent when trading using a trading chart
  def render(self, mode='human', close=False):
    pass
  def close(self):
    pass

  def seed(self, seed=None):
    self.np_random, seed1 = seeding.np_random(seed)
    seed2 = seeding.hash_seed(seed1+1) % 2**33
    return [seed1, seed2]

  # important: creates an instance of the  environment
  @classmethod
  def from_dir(cls, data_dir, **kwargs):
    prices = {f: load_relative(f) for f in price_files(data_dir)}
    return StocksEnv(prices, **kwargs)

# Creating the State Space

In [9]:
class State:
  def __init__(self, bars_count, commission_perc, reset_on_close, reward_on_close=True, volumes=True):
    # checking bars_count is an int
    assert isinstance(bars_count, int)
    # checking that bars_count is greater than zero
    assert bars_count > 0
    # checking commission is a float
    assert isinstance(commission_perc, float)
    # checking commission is greater than zero
    assert commission_perc >= 0.0
    # checking that reset_on_close and reward on close are bools
    assert isinstance(reset_on_close, bool)
    assert isinstance(reward_on_close, bool)
    self.bars_count=bars_count
    self.commission_perc = commission_perc
    self.reset_on_close = reset_on_close
    self.reward_on_close = reward_on_close
    self.volumes = volumes
  
  # method that reset's the environment 
  def reset(self, prices, offset):
    assert isinstance(prices, Prices)
    assert offset >= self.bars_count-1
    self.have_position = False
    self.open_price = 0.0
    self._prices = prices
    self._offset = offset

  # the shape of the state (i.e. 1D vector)
  @property
  def shape(self):
    # the shape is the high, low, and closing prices of the current trading day
    # (i.e. 3 or 4 if volume is used) times the num of bars
    # (i.e. past prices agent can observe) plus the position flag 
    # (i.e. whether agent is holding onto the stock or not) and 
    # the relative profit agent has recieved since opening
    if self.volumes:
      return (4*self.bars_count+1+1, )
    else:
      return (3*self.bars_count+1+1, )
  
  # important: method that encodes the current state
  def encode(self):
    res = np.ndarray(shape=self.shape, dtype=np.float32)
    shift = 0
    for bar_idx in range(-self.bars_count+1, 1):
      res[shift] = self._prices.high[self._offset + bar_idx]
      shift += 1
      res[shift] = self._prices.low[self._offset + bar_idx]
      shift += 1
      res[shift] = self._prices.close[self._offset + bar_idx]
      shift += 1
      if self.volumes:
        res[shift] = self._prices.volume[self._offset + bar_idx]
        shift += 1
    res[shift] = float(self.have_position)
    shift += 1
    if not self.have_position:
      res[shift] = 0.0
    else:
      res[shift] = (self._cur_close() - self.open_price) / self.open_price
    return res
 
  def _cur_close(self):
    """
    Calculate real close price for the current bar
    """
    open = self._prices.open[self._offset]
    rel_close = self._prices.close[self._offset]
    return open * (1.0 + rel_close)

  # important: where agent takes the action (i.e. buying or Selling) based on past price/state, 
  # and returns reward for doing so and updates the price offset
  def step(self, action):
    """
    Perform one step in our price, adjust offset, check for the end of prices
    and handle position change
    :param action:
    :return: reward, done
    """
    assert isinstance(action, Actions)
    reward = 0.0
    done = False
    close = self._cur_close()
    if action == Actions.Buy and not self.have_position:
      self.have_position = True
      self.open_price = close
      reward -= self.commission_perc
    elif action == Actions.Close and self.have_position:
      reward -= self.commission_perc
      done |= self.reset_on_close
      if self.reward_on_close:
        reward += 100.0 * (close - self.open_price) / self.open_price
      self.have_position = False
      self.open_price = 0.0

    self._offset += 1
    prev_close = close
    close = self._cur_close()
    done |= self._offset >= self._prices.close.shape[0]-1
    
    if self.have_position and not self.reward_on_close:
      reward += 100.0 * (close - prev_close) / prev_close
      
    return reward, done


class State1D(State):
    """
    State with shape suitable for 1D convolution
    """
    @property
    def shape(self):
        if self.volumes:
            return (6, self.bars_count)
        else:
            return (5, self.bars_count)

    def encode(self):
        res = np.zeros(shape=self.shape, dtype=np.float32)
        ofs = self.bars_count-1
        res[0] = self._prices.high[self._offset-ofs:self._offset+1]
        res[1] = self._prices.low[self._offset-ofs:self._offset+1]
        res[2] = self._prices.close[self._offset-ofs:self._offset+1]
        if self.volumes:
            res[3] = self._prices.volume[self._offset-ofs:self._offset+1]
            dst = 4
        else:
            dst = 3
        if self.have_position:
            res[dst] = 1.0
            res[dst+1] = (self._cur_close() - self.open_price) / self.open_price
        return res

# Creating the Dueling DQN Model with 1D Convolutions 


In [50]:
class DQNConv1D(nn.Module):
    def __init__(self, shape, actions_n):
        super(DQNConv1D, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv1d(shape[0], 128, 1),
            nn.ReLU(),
            nn.Conv1d(128, 128, 1),
            nn.ReLU(),
        )

        out_size = self._get_conv_out(shape)

        self.fc_val = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        self.fc_adv = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        val = self.fc_val(conv_out)
        adv = self.fc_adv(conv_out)
        return val + adv - adv.mean(dim=1, keepdim=True)



In [18]:
env = StocksEnv(stock_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=True, volumes=True)
env.observation_space.shape

(6, 2)

# Training the Trading Agent

In [4]:
#!/usr/bin/env python3
import os
import gym
import ptan
import argparse
import numpy as np

import torch
import torch.optim as optim

from tensorboardX import SummaryWriter

BATCH_SIZE = 32
BARS_COUNT = 2
TARGET_NET_SYNC = 1000
GAMMA = 0.99
REPLAY_SIZE = 100000
REPLAY_INITIAL = 10000
REWARD_STEPS = 2
LEARNING_RATE = 0.0001
STATES_TO_EVALUATE = 1000
EVAL_EVERY_STEP = 1000
EPSILON_START = 1.0
EPSILON_STOP = 0.1
EPSILON_STEPS = 1000000
CHECKPOINT_EVERY_STEP = 1000000
VALIDATION_EVERY_STEP = 100000
#------------------------------------------------------------------------#
CUDA = True
DEFAULT_STOCKS = "/content/drive/MyDrive/Datasets/SPY/spy_past.csv"
DEFAULT_VAL_STOCKS = "/content/drive/MyDrive/Datasets/SPY/spy_future.csv"
YEAR = None
SAVE_PATH = "saves"

if __name__ == "__main__":
    device = torch.device("cuda" if CUDA else "cpu")
    saves_path = os.path.join("/content/", SAVE_PATH)
    os.makedirs(saves_path, exist_ok=True)

    if YEAR is not None or os.path.isfile(DEFAULT_STOCKS):
        if YEAR is not None:
            stock_data = data.load_year_data(YEAR)
        else:
            stock_data = {"SPY": load_relative(DEFAULT_STOCKS)}
        env = StocksEnv(stock_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=True, volumes=True)
        env_tst = StocksEnv(stock_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=True)
    elif os.path.isdir(DEFAULT_STOCKS):
        env = StocksEnv.from_dir(DEFAULT_STOCKS, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False)
        env_tst = StocksEnv.from_dir(DEFAULT_STOCKS, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False)
    else:
        raise RuntimeError("No data to train on")
    env = gym.wrappers.TimeLimit(env, max_episode_steps=1000)
    
    val_data = {"SPY": load_relative(DEFAULT_VAL_STOCKS)}
    env_val = StocksEnv(val_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=True)

    writer = SummaryWriter(comment="-simple-" + "run")
    net = DQNConv1D(env.observation_space.shape, env.action_space.n).to(device)
    tgt_net = ptan.agent.TargetNet(net)
    selector = ptan.actions.EpsilonGreedyActionSelector(EPSILON_START)
    agent = ptan.agent.DQNAgent(net, selector, device=device)
    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, GAMMA, steps_count=REWARD_STEPS)
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, REPLAY_SIZE)
    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)

    # main training loop
    step_idx = 0
    eval_states = None
    best_mean_val = None

    with RewardTracker(writer, np.inf, group_rewards=100) as reward_tracker:
        while True:
            step_idx += 1
            buffer.populate(1)
            selector.epsilon = max(EPSILON_STOP, EPSILON_START - step_idx / EPSILON_STEPS)

            new_rewards = exp_source.pop_rewards_steps()
            if new_rewards:
                reward_tracker.reward(new_rewards[0], step_idx, selector.epsilon)

            if len(buffer) < REPLAY_INITIAL:
                continue

            if eval_states is None:
                print("Initial buffer populated, start training")
                eval_states = buffer.sample(STATES_TO_EVALUATE)
                eval_states = [np.array(transition.state, copy=False) for transition in eval_states]
                eval_states = np.array(eval_states, copy=False)

            if step_idx % EVAL_EVERY_STEP == 0:
                mean_val = calc_values_of_states(eval_states, net, device=device)
                writer.add_scalar("values_mean", mean_val, step_idx)
                if best_mean_val is None or best_mean_val < mean_val:
                    if best_mean_val is not None:
                        print("%d: Best mean value updated %.3f -> %.3f" % (step_idx, best_mean_val, mean_val))
                    best_mean_val = mean_val
                    torch.save(net.state_dict(), os.path.join(saves_path, "mean_val-%.3f.data" % mean_val))

            optimizer.zero_grad()
            batch = buffer.sample(BATCH_SIZE)
            loss_v = calc_loss(batch, net, tgt_net.target_model, GAMMA ** REWARD_STEPS, device=device)
            loss_v.backward()
            optimizer.step()

            if step_idx % TARGET_NET_SYNC == 0:
                tgt_net.sync()

            if step_idx % CHECKPOINT_EVERY_STEP == 0:
                idx = step_idx // CHECKPOINT_EVERY_STEP
                torch.save(net.state_dict(), os.path.join(saves_path, "checkpoint-%3d.data" % idx))

            if step_idx % VALIDATION_EVERY_STEP == 0:
                # res = validation_run(env_tst, net, device=device)
                # for key, val in res.items():
                #     writer.add_scalar(key + "_test", val, step_idx)
                res = validation_run(env_val, net, device=device)
                for key, val in res.items():
                    writer.add_scalar(key + "_val", val, step_idx)

ModuleNotFoundError: ignored

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs/

In [None]:
!zip -r /content/runs.zip /content/runs/