<a href="https://colab.research.google.com/github/letianzj/QuantResearch/blob/master/ml/reinforcement_trader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

(Use Open in Colab button above to see trading videos)

## Introduction

From reinforcement gamer to reinforcement trader. Part II.

For reinforcement gamer, Part I, check out [the previous notebook](https://github.com/letianzj/QuantResearch/blob/master/ml/atari_space_invaders.ipynb). This notebook shares a lot of resemblance to the previous one.

As illustrated in the figure below, investing bears a clear resemblance to game playing. In fact, some good poke players, such as Edward Thorp, also stand out in the stock markets.


![From Game Player to Stock Trader](https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MSGPuOMqasmUECLHyXj%2Fuploads%2Fgit-blob-87626e4bd747bdb40439277c09abce3e5aeb822d%2Fch5_rl_stock_trading.PNG?alt=media)

source: [Chapter Machine Learning](https://letianzj.gitbook.io/systematic-investing/products_and_methodologies/machine_learning)

Reinforcement learning has been applied to stock trading and portfolio management. Xiong, Zhuoran, et al (2018) explore the stock market and Zhang, et al (2020) trade the futures market. Nan, et al (2020) add news headline sentiments into the training. Spooner, Thomas, et al (2018) study the market makers who face inventory risk. Fischer, T. G. (2018) provides a survey of current reinforcement learning status in financial markets.

This notebook focuses on the trading part. It trains a reinforcement trader to buy and sell stocks. The objective is to achieve higher end dollar profits. Of course, other risk adjusted objectives such as Sharpe ratio are also viable.

The next notebook will focus on the portfolio management part, by training a reinforcement portfolio management to perform strategic allocation among stocks or asset classes.

## Setup

Uncomment to execute once

In [8]:
# !sudo apt-get update
# !pip install yfinance
# !pip install ta
# !pip install -U gym==0.21.0
# !pip install -U quanttrader==0.5.5
# !pip install -U pyfolio==0.9.2

# !sudo apt-get install -y xvfb ffmpeg freeglut3-dev
# !pip install 'imageio==2.4.0'
# !pip install pyvirtualdisplay
# !pip install tf-agents[reverb]
# !pip install pyglet
#!pip install -U PyYaml==3.13
!pip install quanttrader
!pip install pyfolio

Collecting pyfolio
  Downloading pyfolio-0.9.2.tar.gz (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.1/91.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting empyrical>=0.5.0 (from pyfolio)
  Downloading empyrical-0.5.5.tar.gz (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.8/52.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jedi>=0.16 (from ipython>=3.2.3->pyfolio)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: pyfolio, empyrical
  Building wheel for pyfolio (setup.py) ... [?25l[?25hdone
  Created wheel for pyfolio: filename=pyfolio-0.9.2-py3-none-any.whl size=88649 sha256=48f81e616f3ff319493cc6651a87d423506b03f4a2831558c3fde7505

Restart the runtime to take PyYaml==3.13 into effect. Otherwise pyfolio will complain on yaml.load error.

Code below might needs to run twice.

In [15]:
!pip install tf-agents
!pip install pyvirtualdisplay

import os
import io
import tempfile
import shutil
import zipfile
from google.colab import files

from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import yfinance as yf
import gym
import quanttrader as qt
from quanttrader import TradingEnv
import pyfolio as pf

import tensorflow as tf
from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import py_driver
from tf_agents.drivers.dynamic_step_driver import DynamicStepDriver
from tf_agents.environments import tf_py_environment
from tf_agents.environments import suite_gym
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential, q_network, network
from tf_agents.policies import py_tf_eager_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import policy_saver
from tf_agents.replay_buffers import TFUniformReplayBuffer
from tf_agents.trajectories import trajectory
from tf_agents.specs import tensor_spec
from tf_agents.utils import common

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay




In [16]:
gym.__version__, qt.__version__, pf.__version__

('0.25.2', '0.5.5', '0.9.2')

In [28]:
!curl -L http://prdownloads.sourceforge.net/ta-lib/ta-lib-0.4.0-src.tar.gz -O && tar xzvf ta-lib-0.4.0-src.tar.gz !cd ta-lib && ./configure --prefix=/usr && make && make install && cd - && pip install ta-lib
!pip install ta
def load_data():
    from datetime import timedelta
    import ta


    start_date = datetime(2010, 1, 1)
    end_date = datetime(2020, 12, 31)
    syms = ['SPY']
    max_price_scaler = 5_000.0
    max_price_scaler = 1
    max_volume_scaler = 1.5e8
    df_obs = pd.DataFrame()             # observation
    df_exch = pd.DataFrame()            # exchange; for order match

    for sym in syms:
        df = yf.download(sym, start=start_date, end=end_date)
        df.index = pd.to_datetime(df.index) + timedelta(hours=15, minutes=59, seconds=59)

        df_exch = pd.concat([df_exch, df['Close'].rename(sym)], axis=1)

        df['Open'] = df['Adj Close'] / df['Close'] * df['Open'] / max_price_scaler
        df['High'] = df['Adj Close'] / df['Close'] * df['High'] / max_price_scaler
        df['Low'] = df['Adj Close'] / df['Close'] * df['Low'] / max_price_scaler
        df['Volume'] = df['Adj Close'] / df['Close'] * df['Volume'] / max_volume_scaler
        df['Close'] = df['Adj Close'] / max_price_scaler
        df = df[['Open', 'High', 'Low', 'Close', 'Volume']]
        df.columns = [f'{sym}:{c.lower()}' for c in df.columns]

        macd = ta.trend.MACD(close=df[f'{sym}:close'])
        df[f'{sym}:macd'] = macd.macd()
        df[f'{sym}:macd_diff'] = macd.macd_diff()
        df[f'{sym}:macd_signal'] = macd.macd_signal()

        rsi = ta.momentum.RSIIndicator(close=df[f'{sym}:close'])
        df[f'{sym}:rsi'] = rsi.rsi()

        bb = ta.volatility.BollingerBands(close=df[f'{sym}:close'], window=20, window_dev=2)
        df[f'{sym}:bb_bbm'] = bb.bollinger_mavg()
        df[f'{sym}:bb_bbh'] = bb.bollinger_hband()
        df[f'{sym}:bb_bbl'] = bb.bollinger_lband()

        atr = ta.volatility.AverageTrueRange(high=df[f'{sym}:high'], low=df[f'{sym}:low'], close=df[f'{sym}:close'])
        df[f'{sym}:atr'] = atr.average_true_range()

        df_obs = pd.concat([df_obs, df], axis=1)

    return df_obs, df_exch

  and should_run_async(code)


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   380  100   380    0     0   6593      0 --:--:-- --:--:-- --:--:--  6666
100   355  100   355    0     0   3578      0 --:--:-- --:--:-- --:--:--  3578
100 1299k  100 1299k    0     0  1788k      0 --:--:-- --:--:-- --:--:-- 1788k
ta-lib/
ta-lib/config.sub
ta-lib/aclocal.m4
ta-lib/CHANGELOG.TXT
ta-lib/include/
ta-lib/include/ta_abstract.h
ta-lib/include/ta_func.h
ta-lib/include/ta_common.h
ta-lib/include/ta_config.h.in
ta-lib/include/Makefile.am
ta-lib/include/ta_libc.h
ta-lib/include/ta_defs.h
ta-lib/missing
ta-lib/ta-lib.spec.in
ta-lib/config.guess
ta-lib/Makefile.in
ta-lib/ta-lib.dpkg.in
ta-lib/Makefile.am
ta-lib/autogen.sh
ta-lib/install-sh
ta-lib/configure
ta-lib/depcomp
ta-lib/HISTORY.TXT
ta-lib/configure.in
ta-lib/autom4te.cache/
ta-lib/au

In [29]:
df_obs, df_exch = load_data()

[*********************100%%**********************]  1 of 1 completed


## Trading Environment

In [30]:
look_back = 10
cash = 100_000.0
max_nav_scaler = cash

train_qt_env = TradingEnv(2, df_obs, df_exch)
train_qt_env.set_cash(cash)
train_qt_env.set_commission(0.0001)
train_qt_env.set_steps(n_lookback=10, n_warmup=50, n_maxsteps=250)
train_qt_env.set_feature_scaling(max_nav_scaler)

eval_qt_env = TradingEnv(2, df_obs, df_exch)
eval_qt_env.set_cash(cash)
eval_qt_env.set_commission(0.0001)
eval_qt_env.set_steps(n_lookback=10, n_warmup=50, n_maxsteps=2000, n_init_step=504)         # index 504 is 2012-01-03
eval_qt_env.set_feature_scaling(max_nav_scaler)

Take one step to see how the environment works.

In [32]:
o1 = eval_qt_env.reset()
total_reward = 0.0
while True:
    #action = eval_qt_env.action_space.sample()
    action = 1
    o2, reward, done, info = eval_qt_env.step(action)
    total_reward += reward
    #print(action, reward * max_nav_scaler, info)
    #if done:
    #  break
    break

  and should_run_async(code)


The observation has 10 features, and lookback is 14 days.

In [33]:
o1.shape, o2.shape

((10, 14), (10, 14))

In [34]:
idx0 = eval_qt_env._init_step
idx1 = idx0+3
eval_qt_env._df_obs_scaled[idx0:idx1]         # observation

Unnamed: 0_level_0,SPY:open,SPY:high,SPY:low,SPY:close,SPY:volume,SPY:macd,SPY:macd_diff,SPY:macd_signal,SPY:rsi,SPY:bb_bbm,SPY:bb_bbh,SPY:bb_bbl,SPY:atr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2012-01-03 15:59:59,102.667899,103.166133,102.40271,102.458961,1.037704,0.861866,0.27725,0.584616,60.32441,99.863707,103.092144,96.63527,1.629781
2012-01-04 15:59:59,102.217875,102.708071,101.824113,102.619675,0.68138,0.966258,0.305314,0.660945,60.780454,99.955135,103.380981,96.529288,1.576508
2012-01-05 15:59:59,102.06519,103.045577,101.599101,102.892891,0.931613,1.058831,0.318309,0.740522,61.588768,100.058625,103.701289,96.415962,1.56722


In [35]:
eval_qt_env._df_exch[idx0:idx1]

Unnamed: 0,SPY
2012-01-03 15:59:59,127.5
2012-01-04 15:59:59,127.699997
2012-01-05 15:59:59,128.039993


At the end of 2012-01-03, if action = 1 or we all in SPY, then we buy 100_000/127.50 or 784 shares, commission=784x127.50x0.0001=9.996, and the remaining cash=100_000-784x127.5-9.996=30.

Then the market moves to 2012-01-04, and SPY price goes up to 127.70. Our 784 shares are now worth 784x127.70, and NAV including cash becomes 784x127.70+30=100_146.80.

NAV change is the reward, in this case is 146.80.

As shown below.

In [36]:
eval_qt_env._df_positions.iloc[idx0]

SPY          0.0
Cash    100000.0
NAV     100000.0
Name: 2012-01-03 15:59:59, dtype: float64

In [37]:
eval_qt_env._df_positions.iloc[idx0+1]

SPY        784.000000
Cash        30.004000
NAV     100146.801607
Name: 2012-01-04 15:59:59, dtype: float64

In [38]:
reward,  100146.801607-100000.0

(146.801607421875, 146.80160700000124)

Create TF-Agents environment from Gym environment.

In [39]:
train_qt_env = gym.wrappers.FlattenObservation(train_qt_env)
train_py_env = suite_gym.wrap_env(train_qt_env)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)

eval_qt_env = gym.wrappers.FlattenObservation(eval_qt_env)
eval_py_env = suite_gym.wrap_env(eval_qt_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

In [40]:
train_env.action_spec()

BoundedTensorSpec(shape=(), dtype=tf.int64, name='action', minimum=array(0), maximum=array(1))

In [41]:
train_env.time_step_spec()

TimeStep(
{'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'observation': BoundedTensorSpec(shape=(140,), dtype=tf.float32, name='observation', minimum=array(-3.4028235e+38, dtype=float32), maximum=array(3.4028235e+38, dtype=float32)),
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')})

Some helper functions

In [42]:
def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)

def create_policy_eval_video(env, policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = env.reset()
      video.append_data(env.pyenv.envs[0].render())

      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = env.step(action_step.action)
        video.append_data(env.pyenv.envs[0].render())

  return embed_mp4(filename)

## Spontaneous Trader

In [43]:
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())

In [44]:
time_step = train_py_env.reset()

In [45]:
random_policy.action_spec

BoundedTensorSpec(shape=(), dtype=tf.int64, name='action', minimum=array(0), maximum=array(1))

In [46]:
time_step = train_env.reset()
action_step = random_policy.action(time_step)

Below shows spontaneous trader's random trading behavior.

The upper half is SPY price curve along with red buy and green sell marks. The lower half is NAV or total asset value.

Due to random trading window and random trading actions, re-run below code each time will generate slightly different video.

In [47]:
create_policy_eval_video(train_env, random_policy, "random-agent", num_episodes=1)

See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: h

In [None]:
train_env.pyenv.envs[0].env._df_positions

## Reinforcement Trader

### Recruit a Trader

Hree we recruit a DQN trader, giving her $100,000 and let her trade SPY.

Hopefully after 1 millon times of simulated training, she is able to to find a good quantititive trading rule to trade SPY.

Her trading rule is a black box. We don't care how she trades, as long as she keeps bringing in profits.

Then she is expected to apply her deep neutral network trading rule to other stocks via so-called transfer learning.

In [48]:
learning_rate = 1e-3
num_eval_episodes = 10
replay_buffer_max_length = 100000

In [49]:
fc_layer_params = (100, 50)
action_tensor_spec = tensor_spec.from_spec(train_env.action_spec())
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1

# Define a helper function to create Dense layers configured with the right
# activation and kernel initializer.
def dense_layer(num_units):
  return tf.keras.layers.Dense(
      num_units,
      activation=tf.keras.activations.relu,
      kernel_initializer=tf.keras.initializers.VarianceScaling(
          scale=2.0, mode='fan_in', distribution='truncated_normal'))

# QNetwork consists of a sequence of Dense layers followed by a dense layer
# with `num_actions` units to generate one q_value per available action as
# its output.
dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]
q_values_layer = tf.keras.layers.Dense(
    num_actions,
    activation=None,
    kernel_initializer=tf.keras.initializers.RandomUniform(
        minval=-0.03, maxval=0.03),
    bias_initializer=tf.keras.initializers.Constant(-0.2))
q_net = sequential.Sequential(dense_layers + [q_values_layer])

In [50]:
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

Here we use a three layer fully-connected deep neural network.

The input observation is 14 days x 10 features, flattened to shape 140. The first layer has 100 neurons. Therefore, it requires $140 \times 100+100=14,100$ parameters.

The second layer has 50 neurons, implying $100 \times 50 + 50 = 5,050$ parameters.

The output layer is a binary decision of either buy or sell, which needs $50 \times 2 + 2 = 102$ parameters.

In [51]:
q_net.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               multiple                  14100     
                                                                 
 dense_1 (Dense)             multiple                  5050      
                                                                 
 dense_2 (Dense)             multiple                  102       
                                                                 
Total params: 19252 (75.20 KB)
Trainable params: 19252 (75.20 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [52]:
eval_policy = agent.policy
collect_policy = agent.collect_policy

In [53]:
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())

In [54]:
time_step = train_env.reset()

In [55]:
random_policy.action(time_step)

PolicyStep(action=<tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>, state=(), info=())

In [56]:
def compute_avg_return(environment, policy, num_episodes=5):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0
    zeros = 0
    ones = 0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      if action_step.action.numpy()[0] == 1:
        ones+=1
      else:
        zeros+=1
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0], zeros, ones

In [57]:
compute_avg_return(eval_env, random_policy, num_episodes=1)

(95933.625, 1003, 997)

In [58]:
replay_buffer = TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=100_000)

  and should_run_async(code)


In [None]:
# replay_buffer_observer = replay_buffer.add_batch

In [59]:
train_env.reset()

init_driver = DynamicStepDriver(
    train_env,
    random_policy,
    observers=[replay_buffer.add_batch],
    num_steps=2_500)
final_time_step, final_policy_state = init_driver.run()

In [60]:
trajectories, buffer_info = replay_buffer.get_next(sample_batch_size=2, num_steps=3)

Instructions for updating:
Use `as_dataset(..., single_deterministic_pass=False) instead.


In [61]:
trajectories.observation.shape

TensorShape([2, 3, 140])

In [62]:
from tf_agents.trajectories.trajectory import to_transition
time_steps, action_steps, next_time_steps = to_transition(trajectories)
time_steps.observation.shape

TensorShape([2, 2, 140])

Each row of the replay buffer only stores a single observation step. But since the DQN Agent needs both the current and next observation to compute the loss, the dataset pipeline will sample two adjacent rows for each item in the batch (`num_steps=2`).

In [63]:
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=64,
    num_steps=2).prefetch(3)

Instructions for updating:
Use `tf.data.Dataset.counter(...)` instead.


In [64]:
dataset

<_PrefetchDataset element_spec=(Trajectory(
{'action': TensorSpec(shape=(64, 2), dtype=tf.int64, name=None),
 'discount': TensorSpec(shape=(64, 2), dtype=tf.float32, name=None),
 'next_step_type': TensorSpec(shape=(64, 2), dtype=tf.int32, name=None),
 'observation': TensorSpec(shape=(64, 2, 140), dtype=tf.float32, name=None),
 'policy_info': (),
 'reward': TensorSpec(shape=(64, 2), dtype=tf.float32, name=None),
 'step_type': TensorSpec(shape=(64, 2), dtype=tf.int32, name=None)}), BufferInfo(ids=TensorSpec(shape=(64, 2), dtype=tf.int64, name=None), probabilities=TensorSpec(shape=(64,), dtype=tf.float32, name=None)))>

In [65]:
iterator = iter(dataset)

In [66]:
num_iterations = 1_000_000   # less intelligence, more persistance; 24x7 player
save_interval = 100_000
eval_interval = 50_000
log_interval = 5_000

In [67]:
# Create a driver to collect experience.
collect_driver = DynamicStepDriver(
    train_env,
    agent.collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=4) # collect 4 steps for each training iteration

### Training the Trader

In [68]:
# (Optional) Optimize by wrapping some of the code in a graph using TF function.
collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)

# Reset the train step.
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(train_env, agent.policy, num_episodes=1)[0]
returns = np.array([avg_return])

# Reset the environment.
time_step = None
policy_state = agent.collect_policy.get_initial_state(train_env.batch_size)

In [71]:
num_iterations = 1000
while True:
    # Collect a few steps using collect_policy and save to the replay buffer.
    time_step, policy_state = collect_driver.run(time_step, policy_state)

    # Sample a batch of data from the buffer and update the agent's network.
    experience, unused_info = next(iterator)
    train_loss = agent.train(experience).loss

    step = agent.train_step_counter.numpy()
    print(f'\r step {step}', end='')

    if step % log_interval == 0:
        print('step = {0}: loss = {1}'.format(step, train_loss))

    if step % eval_interval == 0:
        avg_return = compute_avg_return(train_env, agent.policy, num_episodes=1)[0]
        print('step = {0}: Average Return = {1}'.format(step, avg_return))
        returns = np.append(returns, avg_return)

    # if step % save_interval == 0:
    #     save_checkpoint_to_local()

    if step > num_iterations:
        break

 step 54342

### Save the Trader Model

The policy is saved to github. We can wget from github and unzip it.

In [72]:
def create_zip_file(dirname, base_filename):
  return shutil.make_archive(base_filename, 'zip', dirname)

In [73]:
tempdir = os.getenv("TEST_TMPDIR", tempfile.gettempdir())

policy_dir = os.path.join(tempdir, 'policy')
tf_policy_saver = policy_saver.PolicySaver(agent.policy)

In [74]:
tf_policy_saver.save(policy_dir)
policy_zip_filename = create_zip_file(policy_dir, os.path.join(tempdir, 'exported_policy'))



In [75]:
!ls /tmp/exported_policy.zip

/tmp/exported_policy.zip


In [76]:
files.download(policy_zip_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

export checkpoint

In [77]:
checkpoint_dir = os.path.join(tempdir, 'checkpoint')
train_checkpointer = common.Checkpointer(
    ckpt_dir=checkpoint_dir,
    max_to_keep=1,
    agent=agent,
    policy=agent.policy,
    replay_buffer=replay_buffer,
    global_step=train_step_counter
)

In [78]:
train_checkpointer.save(train_step_counter)

In [79]:
# !rm -rf /tmp/checkpoint
!ls /tmp/checkpoint -l

total 58128
-rw-r--r-- 1 root root      174 Nov  4 06:52 checkpoint
-rw-r--r-- 1 root root 59514118 Nov  4 06:52 ckpt-54342.data-00000-of-00001
-rw-r--r-- 1 root root     2462 Nov  4 06:52 ckpt-54342.index


In [80]:
checkpoint_zip_filename = create_zip_file(checkpoint_dir, os.path.join(tempdir, 'exported_cp'))

In [81]:
files.download(checkpoint_zip_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Restore checkpoint

In [82]:
saved_policy = tf.saved_model.load(policy_dir)

### Evaluate Trader Performance

In [84]:
create_policy_eval_video(eval_env, agent.policy, "trained-agent", num_episodes=1)

See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  spec = None
See here for more information: h

KeyboardInterrupt: ignored

In [85]:
compute_avg_return(eval_env, random_policy, num_episodes=1)

x_left = eval_env.pyenv.envs[0].env._init_step
x_right = eval_env.pyenv.envs[0].env._current_step     # _maxsteps+1
df_price = eval_env.pyenv.envs[0].env._df_exch[x_left:x_right].copy()
df_spontaneous = eval_env.pyenv.envs[0].env._df_positions['NAV'][x_left:x_right].copy()

In [86]:
print(compute_avg_return(eval_env, saved_policy, num_episodes=1))
df_reinforcement = eval_env.pyenv.envs[0].env._df_positions['NAV'][x_left:x_right].copy()

(148808.81, 0, 2000)


In [87]:
df_all = pd.concat([df_spontaneous, df_reinforcement, df_price], axis=1)
df_all.columns = ['spontaneous', 'tf-agent', 'benchmark']
df_ret = df_all / df_all.shift(1) - 1
df_ret = df_ret[1:]
random_perf_stats = pf.timeseries.perf_stats(df_ret['spontaneous'])
agent_perf_stats = pf.timeseries.perf_stats(df_ret['tf-agent'])
benchmark_perf_stats = pf.timeseries.perf_stats(df_ret['benchmark'])
perf_stats = pd.concat([random_perf_stats, agent_perf_stats, benchmark_perf_stats], axis=1)
perf_stats.columns = ['random', 'tf-agent', 'benchmark']
perf_stats

  stats = pd.Series()
  stats = pd.Series()
  stats = pd.Series()


Unnamed: 0,random,tf-agent,benchmark
Annual return,0.007252,0.121686,0.121725
Cumulative returns,0.058992,1.486599,1.487294
Annual volatility,0.095888,0.12896,0.128985
Sharpe ratio,0.123372,0.955222,0.955339
Calmar ratio,0.033012,0.603126,0.603244
Stability,0.152883,0.950936,0.950932
Max drawdown,-0.219672,-0.201759,-0.201785
Omega ratio,1.03081,1.185775,1.185798
Sortino ratio,0.168937,1.345137,1.345309
Skew,-0.454246,-0.401827,-0.401799


The reinforcement trader outperforms random spontaneous trader. She delivers slightly less annual return than buy-and-hold (11.6% vs 12.17%), but with better risk-adjusted measures (Sharpe 1.028 vs 0.955).