# NeurIPS2018_SB3

This is a modification version based on the first notebook from [FinRL-tutorial](https://github.com/AI4Finance-Foundation/FinRL-Tutorials)

https://github.com/AI4Finance-Foundation/FinRL-Tutorials/blob/master/1-Introduction/Stock_NeurIPS2018_SB3.ipynb

## Part 1. Task Discription
DRL agent training for cryptocurrency trading. 

This task is modeled as a Markov Decision Process (MDP), and the objective function is maximizing (expected) cumulative return.

We specify the state-action-reward as follows:

* **State s**: The state space represents an agent's perception of the market environment. Just like a human trader analyzing various information, here our agent passively observes many features and learns by interacting with the market environment (usually by replaying historical data).

* **Action a**: The action space includes allowed actions that an agent can take at each state. For example, a ∈ {−1, 0, 1}, where −1, 0, 1 represent

selling, holding, and buying. When an action operates single crypto a ∈{−k, ..., −1, 0, 1, ..., k}, e.g.. "Buy 10 units of BTC" or "Sell 10 units of BTC" are 10 or −10, respectively

* **Reward function r(s, a, s′)**: Reward is an incentive for an agent to learn a better policy. For example, it can be the change of the portfolio value when taking a at state s and arriving at new state s', i.e., r(s, a, s′) = v′ − v, where v′ and v represent the portfolio values at state s′ and s, respectively

**Market environment**: Cryptocurrencies from Binance

##  Part 2. Import Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# matplotlib.use('Agg')
import datetime, sqlite3, zipfile, os

%matplotlib inline
# from finrl.meta.preprocessor.yahoodownloader import YahooDownloader
from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split
from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv
from finrl.agents.stablebaselines3.models import DRLAgent
from stable_baselines3.common.logger import configure
from finrl.meta.data_processor import DataProcessor

from finrl.plot import backtest_stats, backtest_plot, get_daily_return, get_baseline
from pprint import pprint

import sys
sys.path.append("../FinRL")

import itertools



### Create Folders

In [2]:
from finrl import config
from finrl import config_tickers
import os
from finrl.main import check_and_make_directories
from finrl.config import (
    TRAINED_MODEL_DIR,
    TENSORBOARD_LOG_DIR,
    RESULTS_DIR,
    INDICATORS,
    TRAIN_START_DATE,
    TRAIN_END_DATE,
    TEST_START_DATE,
    TEST_END_DATE,
    TRADE_START_DATE,
    TRADE_END_DATE,
)
check_and_make_directories([TRAINED_MODEL_DIR, TENSORBOARD_LOG_DIR, RESULTS_DIR])

### Read Data

Read data from `binance-public-data`

In [3]:
# List of symbols to merge
symbols = ['BTCUSDT']

# List to store individual DataFrames
dfs = []

# Loop through each symbol
for symbol in symbols:
    directory = f'../mdt_utils/binance-public-data/python/data/spot/monthly/klines/{symbol}/1d/2020-01-01_2023-07-01/'
    
    # Loop through each zip file in the directory
    for file_name in os.listdir(directory):
        if file_name.endswith('.zip'):
            with zipfile.ZipFile(os.path.join(directory, file_name), 'r') as zip_ref:
                # only one CSV file in each zip archive
                csv_file = zip_ref.namelist()[0]
                with zip_ref.open(csv_file) as csv_fp:
                    # Read the CSV data into a DataFrame
                    temp_df = pd.read_csv(csv_fp, header=None)
                    temp_df.columns = ['open_time', 'open', 'high', 'low', 'close', 'volume', 'close_time', 'quote_asset_volume', 'number_of_trades', 'taker_buy_base_asset_volume', 'taker_buy_quote_asset_volume', 'ignore']
                    temp_df['date'] = pd.to_datetime(temp_df['close_time'], unit='ms').dt.strftime('%Y-%m-%d')
                    temp_df['day'] = (pd.to_datetime(temp_df['date']) - pd.to_datetime(temp_df['date'].iloc[0])).dt.days
                    temp_df['tic'] = symbol
                    dfs.append(temp_df[['date', 'open', 'high', 'low', 'close', 'volume', 'tic', 'day']])

# Concatenate all DataFrames into a single DataFrame
df = pd.concat(dfs, ignore_index=True)

df.sort_values(['date','tic'],ignore_index=True).head()

Unnamed: 0,date,open,high,low,close,volume,tic,day
0,2020-01-01,7195.24,7255.0,7175.15,7200.85,16792.388165,BTCUSDT,0
1,2020-01-02,7200.77,7212.5,6924.74,6965.71,31951.483932,BTCUSDT,1
2,2020-01-03,6965.49,7405.0,6871.04,7344.96,68428.500451,BTCUSDT,2
3,2020-01-04,7345.0,7404.0,7272.21,7354.11,29987.974977,BTCUSDT,3
4,2020-01-05,7354.19,7495.0,7318.0,7358.75,38331.085604,BTCUSDT,4


In [4]:
TRAIN_START_DATE = '2020-01-01'
TRAIN_END_DATE = '2023-12-31'
TRADE_START_DATE = '2023-01-01'
TRADE_END_DATE = '2023-07-31'

## Part 4. Preprocess Data

TODO: The default feature engineering is based on date. I need to rewrite into timestamp based method

The dafult [data split](https://github.com/AI4Finance-Foundation/FinRL/blob/master/finrl/meta/preprocessor/preprocessors.py) is not applicable here. Need to manually redo it.

In [5]:
fe = FeatureEngineer(
                    use_technical_indicator=True,
                    tech_indicator_list = INDICATORS,
                    use_vix=True,
                    use_turbulence=True,
                    user_defined_feature = False)

processed = fe.preprocess_data(df)

Successfully added technical indicators
[*********************100%***********************]  1 of 1 completed
Shape of DataFrame:  (899, 8)
Successfully added vix
Successfully added turbulence index


In [6]:
list_ticker = processed["tic"].unique().tolist()
list_date = list(pd.date_range(processed['date'].min(),processed['date'].max()).astype(str))
combination = list(itertools.product(list_date,list_ticker))

processed_full = pd.DataFrame(combination,columns=["date","tic"]).merge(processed,on=["date","tic"],how="left")
processed_full = processed_full[processed_full['date'].isin(processed['date'])]
processed_full = processed_full.sort_values(['date','tic'])

processed_full = processed_full.fillna(0)

In [7]:
processed_full.sort_values(['date','tic'],ignore_index=True).head(10)

Unnamed: 0,date,tic,open,high,low,close,volume,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,vix,turbulence
0,2020-01-02,BTCUSDT,7200.77,7212.5,6924.74,6965.71,31951.483932,1.0,-5.275577,7415.818177,6750.741823,0.0,-66.666667,100.0,7083.28,7083.28,12.47,0.0
1,2020-01-03,BTCUSDT,6965.49,7405.0,6871.04,7344.96,68428.500451,2.0,5.038388,7553.380949,6787.632384,62.525554,48.566103,9.7842,7170.506667,7170.506667,14.02,0.0
2,2020-01-06,BTCUSDT,7357.64,7795.34,7346.76,7758.0,54635.695316,5.0,30.994099,7847.466595,6813.326738,78.616437,144.230167,47.804389,7330.396667,7330.396667,13.85,0.0
3,2020-01-07,BTCUSDT,7758.9,8207.68,7723.71,8145.28,91171.684661,6.0,59.818211,8222.855977,6670.761165,84.911918,170.739586,67.374223,7446.808571,7446.808571,13.79,0.0
4,2020-01-08,BTCUSDT,8145.92,8455.0,7870.0,8055.98,112622.64264,7.0,73.936775,8360.665473,6685.244527,79.340168,129.684605,73.697212,7522.955,7522.955,13.45,0.0
5,2020-01-09,BTCUSDT,8054.72,8055.96,7750.0,7817.76,64239.51983,8.0,70.769674,8363.588357,6747.833865,67.175888,66.925906,57.832118,7555.711111,7555.711111,12.54,0.0
6,2020-01-10,BTCUSDT,7817.74,8199.0,7672.0,8197.02,82406.777448,9.0,87.028833,8482.777731,6756.906269,73.793294,83.067127,62.228124,7619.842,7619.842,12.56,0.0
7,2020-01-13,BTCUSDT,8184.97,8196.0,8055.89,8110.34,31159.755683,12.0,97.723665,8594.488743,6869.165103,67.344654,73.558586,59.274326,7731.826923,7731.826923,12.32,0.0
8,2020-01-14,BTCUSDT,8110.34,8880.0,8105.54,8810.01,120399.126742,13.0,137.69947,8818.333521,6799.346479,75.718364,137.207864,74.38009,7808.84,7808.84,12.39,0.0
9,2020-01-15,BTCUSDT,8814.64,8916.48,8564.0,8821.41,84816.297606,14.0,166.766812,8980.744388,6771.944946,75.822861,144.724492,74.91064,7876.344667,7876.344667,12.42,0.0


## Part 5. Build A Market Environment in OpenAI Gym-style
The training process involves observing cryptocurrency price change, taking an action and reward's calculation. By interacting with the market environment, the agent will eventually derive a trading strategy that may maximize (expected) rewards.

Our market environment, based on OpenAI Gym, simulates stock markets with historical market data.

### Data Split
We split the data into training set and testing set as follows:

In [8]:
train = data_split(processed_full, TRAIN_START_DATE, TRAIN_END_DATE)
trade = data_split(processed_full, TRADE_START_DATE, TRADE_END_DATE)
train_length = len(train)
trade_length = len(trade)
print(f"Training Data length: {train_length}")
print(f"Trade Data Length: {trade_length}")
print(f"Indicators: {INDICATORS}")

Training Data length: 899
Trade Data Length: 143
Indicators: ['macd', 'boll_ub', 'boll_lb', 'rsi_30', 'cci_30', 'dx_30', 'close_30_sma', 'close_60_sma']


In [9]:
stock_dimension = len(train.tic.unique())
state_space = 1 + 2*stock_dimension + len(INDICATORS)*stock_dimension
print(f"Crypto Dimension: {stock_dimension}, State Space: {state_space}")

Crypto Dimension: 1, State Space: 11


In [10]:
buy_cost_list = sell_cost_list = [0.001] * stock_dimension
num_stock_shares = [0] * stock_dimension

env_kwargs = {
    "hmax": 100,
    "initial_amount": 1000000,
    "num_stock_shares": num_stock_shares,
    "buy_cost_pct": buy_cost_list,
    "sell_cost_pct": sell_cost_list,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4
}

e_train_gym = StockTradingEnv(df = train, **env_kwargs)

### Environment for Training

In [11]:
env_train, _ = e_train_gym.get_sb_env()
print(type(env_train))

<class 'stable_baselines3.common.vec_env.dummy_vec_env.DummyVecEnv'>


## Part 6. Train DRL Agents
* The DRL algorithms are from Stable Baselines 3. Users are also encouraged to try ElegantRL and Ray RLlib.
* FinRL includes fine-tuned standard DRL algorithms, such as DQN, DDPG, Multi-Agent DDPG, PPO, SAC, A2C and TD3. We also allow users to
design their own DRL algorithms by adapting these DRL algorithms.

design their own DRL algorithms by adapting these DRL algorithms.

In [12]:
agent = DRLAgent(env = env_train)

if_using_a2c = True
if_using_ddpg = True
if_using_ppo = True
if_using_td3 = True
if_using_sac = True

### Agent 1: A2C

In [13]:
agent = DRLAgent(env = env_train)
model_a2c = agent.get_model("a2c")

if if_using_a2c:
  # set up logger
  tmp_path = RESULTS_DIR + '/a2c'
  new_logger_a2c = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_a2c.set_logger(new_logger_a2c)

{'n_steps': 5, 'ent_coef': 0.01, 'learning_rate': 0.0007}
Using cpu device
Logging to results/a2c


In [14]:
trained_a2c = agent.train_model(model=model_a2c, 
                             tb_log_name='a2c',
                             total_timesteps=50000) if if_using_a2c else None

------------------------------------
| time/                 |          |
|    fps                | 1597     |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -1.43    |
|    explained_variance | 0.117    |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 2.07     |
|    reward             | 0.0      |
|    std                | 1.01     |
|    value_loss         | 30.9     |
------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 1751      |
|    iterations         | 200       |
|    time_elapsed       | 0         |
|    total_timesteps    | 1000      |
| train/                |           |
|    entropy_loss       | -1.43     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 19

--------------------------------------
| time/                 |            |
|    fps                | 1905       |
|    iterations         | 1500       |
|    time_elapsed       | 3          |
|    total_timesteps    | 7500       |
| train/                |            |
|    entropy_loss       | -1.41      |
|    explained_variance | 1.19e-07   |
|    learning_rate      | 0.0007     |
|    n_updates          | 1499       |
|    policy_loss        | 50.3       |
|    reward             | -28.317636 |
|    std                | 0.995      |
|    value_loss         | 1.05e+03   |
--------------------------------------
------------------------------------
| time/                 |          |
|    fps                | 1909     |
|    iterations         | 1600     |
|    time_elapsed       | 4        |
|    total_timesteps    | 8000     |
| train/                |          |
|    entropy_loss       | -1.42    |
|    explained_variance | 0        |
|    learning_rate      | 0.0007   |
|    n

-------------------------------------
| time/                 |           |
|    fps                | 1949      |
|    iterations         | 2900      |
|    time_elapsed       | 7         |
|    total_timesteps    | 14500     |
| train/                |           |
|    entropy_loss       | -1.36     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 2899      |
|    policy_loss        | -6.36     |
|    reward             | -1.111362 |
|    std                | 0.941     |
|    value_loss         | 51        |
-------------------------------------
--------------------------------------
| time/                 |            |
|    fps                | 1951       |
|    iterations         | 3000       |
|    time_elapsed       | 7          |
|    total_timesteps    | 15000      |
| train/                |            |
|    entropy_loss       | -1.37      |
|    explained_variance | 0          |
|    learning_rate      | 0.0007     |
| 

------------------------------------
| time/                 |          |
|    fps                | 1966     |
|    iterations         | 4300     |
|    time_elapsed       | 10       |
|    total_timesteps    | 21500    |
| train/                |          |
|    entropy_loss       | -1.38    |
|    explained_variance | 0        |
|    learning_rate      | 0.0007   |
|    n_updates          | 4299     |
|    policy_loss        | -1.23    |
|    reward             | 7.939428 |
|    std                | 0.964    |
|    value_loss         | 60.1     |
------------------------------------
--------------------------------------
| time/                 |            |
|    fps                | 1966       |
|    iterations         | 4400       |
|    time_elapsed       | 11         |
|    total_timesteps    | 22000      |
| train/                |            |
|    entropy_loss       | -1.39      |
|    explained_variance | 0          |
|    learning_rate      | 0.0007     |
|    n_updates    

------------------------------------
| time/                 |          |
|    fps                | 1970     |
|    iterations         | 5700     |
|    time_elapsed       | 14       |
|    total_timesteps    | 28500    |
| train/                |          |
|    entropy_loss       | -1.36    |
|    explained_variance | 5.19e-06 |
|    learning_rate      | 0.0007   |
|    n_updates          | 5699     |
|    policy_loss        | -10.3    |
|    reward             | 5.480388 |
|    std                | 0.943    |
|    value_loss         | 86.6     |
------------------------------------
------------------------------------
| time/                 |          |
|    fps                | 1971     |
|    iterations         | 5800     |
|    time_elapsed       | 14       |
|    total_timesteps    | 29000    |
| train/                |          |
|    entropy_loss       | -1.37    |
|    explained_variance | 0        |
|    learning_rate      | 0.0007   |
|    n_updates          | 5799     |
|

day: 898, episode: 40
begin_total_asset: 1000000.00
end_total_asset: 4134581.67
total_reward: 3134581.67
total_cost: 997.71
total_trades: 898
Sharpe: 0.930
------------------------------------
| time/                 |          |
|    fps                | 1975     |
|    iterations         | 7100     |
|    time_elapsed       | 17       |
|    total_timesteps    | 35500    |
| train/                |          |
|    entropy_loss       | -1.33    |
|    explained_variance | 0        |
|    learning_rate      | 0.0007   |
|    n_updates          | 7099     |
|    policy_loss        | -25      |
|    reward             | 7.018134 |
|    std                | 0.918    |
|    value_loss         | 2.45e+03 |
------------------------------------
------------------------------------
| time/                 |          |
|    fps                | 1975     |
|    iterations         | 7200     |
|    time_elapsed       | 18       |
|    total_timesteps    | 36000    |
| train/                |     

-------------------------------------
| time/                 |           |
|    fps                | 1969      |
|    iterations         | 8500      |
|    time_elapsed       | 21        |
|    total_timesteps    | 42500     |
| train/                |           |
|    entropy_loss       | -1.33     |
|    explained_variance | -1.19e-07 |
|    learning_rate      | 0.0007    |
|    n_updates          | 8499      |
|    policy_loss        | 54.1      |
|    reward             | -8.150223 |
|    std                | 0.911     |
|    value_loss         | 1.34e+03  |
-------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 1969      |
|    iterations         | 8600      |
|    time_elapsed       | 21        |
|    total_timesteps    | 43000     |
| train/                |           |
|    entropy_loss       | -1.32     |
|    explained_variance | -1.19e-07 |
|    learning_rate      | 0.0007    |
|    n_updat

------------------------------------
| time/                 |          |
|    fps                | 1964     |
|    iterations         | 9900     |
|    time_elapsed       | 25       |
|    total_timesteps    | 49500    |
| train/                |          |
|    entropy_loss       | -1.35    |
|    explained_variance | 1.47e-05 |
|    learning_rate      | 0.0007   |
|    n_updates          | 9899     |
|    policy_loss        | -23.6    |
|    reward             | 3.651195 |
|    std                | 0.931    |
|    value_loss         | 1.03e+03 |
------------------------------------
-------------------------------------
| time/                 |           |
|    fps                | 1964      |
|    iterations         | 10000     |
|    time_elapsed       | 25        |
|    total_timesteps    | 50000     |
| train/                |           |
|    entropy_loss       | -1.35     |
|    explained_variance | 0         |
|    learning_rate      | 0.0007    |
|    n_updates          | 99

In [15]:
trained_a2c.save(TRAINED_MODEL_DIR + "/agent_a2c") if if_using_a2c else None

### Agent 2: DDPG

In [16]:
agent = DRLAgent(env = env_train)
model_ddpg = agent.get_model("ddpg")

if if_using_ddpg:
  # set up logger
  tmp_path = RESULTS_DIR + '/ddpg'
  new_logger_ddpg = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_ddpg.set_logger(new_logger_ddpg)

{'batch_size': 128, 'buffer_size': 50000, 'learning_rate': 0.001}
Using cpu device
Logging to results/ddpg


In [17]:
trained_ddpg = agent.train_model(model=model_ddpg, 
                             tb_log_name='ddpg',
                             total_timesteps=50000) if if_using_ddpg else None

day: 898, episode: 60
begin_total_asset: 1000000.00
end_total_asset: 1000000.00
total_reward: 0.00
total_cost: 0.00
total_trades: 0
---------------------------------
| time/              |          |
|    episodes        | 4        |
|    fps             | 341      |
|    time_elapsed    | 10       |
|    total_timesteps | 3596     |
| train/             |          |
|    actor_loss      | 1.6e+03  |
|    critic_loss     | 7.52e+04 |
|    learning_rate   | 0.001    |
|    n_updates       | 2697     |
|    reward          | 0.0      |
---------------------------------
---------------------------------
| time/              |          |
|    episodes        | 8        |
|    fps             | 301      |
|    time_elapsed    | 23       |
|    total_timesteps | 7192     |
| train/             |          |
|    actor_loss      | 1.31e+03 |
|    critic_loss     | 2.52e+04 |
|    learning_rate   | 0.001    |
|    n_updates       | 6293     |
|    reward          | 0.0      |
------------------

In [18]:
trained_ddpg.save(TRAINED_MODEL_DIR + "/agent_ddpg") if if_using_ddpg else None

### Agent 3: PPO

In [19]:
agent = DRLAgent(env = env_train)
PPO_PARAMS = {
    "n_steps": 2048,
    "ent_coef": 0.01,
    "learning_rate": 0.00025,
    "batch_size": 128,
}
model_ppo = agent.get_model("ppo",model_kwargs = PPO_PARAMS)

if if_using_ppo:
  # set up logger
  tmp_path = RESULTS_DIR + '/ppo'
  new_logger_ppo = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_ppo.set_logger(new_logger_ppo)

{'n_steps': 2048, 'ent_coef': 0.01, 'learning_rate': 0.00025, 'batch_size': 128}
Using cpu device
Logging to results/ppo


In [20]:
trained_ppo = agent.train_model(model=model_ppo, 
                             tb_log_name='ppo',
                             total_timesteps=50000) if if_using_ppo else None

----------------------------------
| time/              |           |
|    fps             | 3173      |
|    iterations      | 1         |
|    time_elapsed    | 0         |
|    total_timesteps | 2048      |
| train/             |           |
|    reward          | 2.2261627 |
----------------------------------
-------------------------------------------
| time/                   |               |
|    fps                  | 2853          |
|    iterations           | 2             |
|    time_elapsed         | 1             |
|    total_timesteps      | 4096          |
| train/                  |               |
|    approx_kl            | 0.00039865985 |
|    clip_fraction        | 0.00122       |
|    clip_range           | 0.2           |
|    entropy_loss         | -1.42         |
|    explained_variance   | -0.00108      |
|    learning_rate        | 0.00025       |
|    loss                 | 139           |
|    n_updates            | 10            |
|    policy_gradient_loss

-----------------------------------------
| time/                   |             |
|    fps                  | 2611        |
|    iterations           | 11          |
|    time_elapsed         | 8           |
|    total_timesteps      | 22528       |
| train/                  |             |
|    approx_kl            | 0.003525573 |
|    clip_fraction        | 0.0177      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.44       |
|    explained_variance   | 0.00242     |
|    learning_rate        | 0.00025     |
|    loss                 | 73.4        |
|    n_updates            | 100         |
|    policy_gradient_loss | -0.00132    |
|    reward               | 0.0         |
|    std                  | 1.03        |
|    value_loss           | 139         |
-----------------------------------------
day: 898, episode: 140
begin_total_asset: 1000000.00
end_total_asset: 857651.64
total_reward: -142348.36
total_cost: 417830.09
total_trades: 776
Sharpe: 0.156
-

day: 898, episode: 160
begin_total_asset: 1000000.00
end_total_asset: 2731651.60
total_reward: 1731651.60
total_cost: 287916.52
total_trades: 838
Sharpe: 0.783
-----------------------------------------
| time/                   |             |
|    fps                  | 2608        |
|    iterations           | 21          |
|    time_elapsed         | 16          |
|    total_timesteps      | 43008       |
| train/                  |             |
|    approx_kl            | 0.004268953 |
|    clip_fraction        | 0.0455      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.44       |
|    explained_variance   | 0.000203    |
|    learning_rate        | 0.00025     |
|    loss                 | 738         |
|    n_updates            | 200         |
|    policy_gradient_loss | -0.00384    |
|    reward               | -0.26156336 |
|    std                  | 1.02        |
|    value_loss           | 1.38e+03    |
-----------------------------------------


In [21]:
trained_ppo.save(TRAINED_MODEL_DIR + "/agent_ppo") if if_using_ppo else None

### Agent 4: TD3

In [22]:
agent = DRLAgent(env = env_train)
TD3_PARAMS = {"batch_size": 100, 
              "buffer_size": 1000000, 
              "learning_rate": 0.001}

model_td3 = agent.get_model("td3",model_kwargs = TD3_PARAMS)

if if_using_td3:
  # set up logger
  tmp_path = RESULTS_DIR + '/td3'
  new_logger_td3 = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_td3.set_logger(new_logger_td3)

{'batch_size': 100, 'buffer_size': 1000000, 'learning_rate': 0.001}
Using cpu device
Logging to results/td3


In [23]:
trained_td3 = agent.train_model(model=model_td3, 
                             tb_log_name='td3',
                             total_timesteps=50000) if if_using_td3 else None

---------------------------------
| time/              |          |
|    episodes        | 4        |
|    fps             | 387      |
|    time_elapsed    | 9        |
|    total_timesteps | 3596     |
| train/             |          |
|    actor_loss      | 3.22e+03 |
|    critic_loss     | 1.23e+05 |
|    learning_rate   | 0.001    |
|    n_updates       | 2697     |
|    reward          | 1.288176 |
---------------------------------
---------------------------------
| time/              |          |
|    episodes        | 8        |
|    fps             | 333      |
|    time_elapsed    | 21       |
|    total_timesteps | 7192     |
| train/             |          |
|    actor_loss      | 4.36e+03 |
|    critic_loss     | 1.15e+05 |
|    learning_rate   | 0.001    |
|    n_updates       | 6293     |
|    reward          | 1.288176 |
---------------------------------
day: 898, episode: 180
begin_total_asset: 1000000.00
end_total_asset: 4134581.67
total_reward: 3134581.67
total_cost

In [24]:
trained_td3.save(TRAINED_MODEL_DIR + "/agent_td3") if if_using_td3 else None

### Agent 5: SAC

In [25]:
agent = DRLAgent(env = env_train)
SAC_PARAMS = {
    "batch_size": 128,
    "buffer_size": 100000,
    "learning_rate": 0.0001,
    "learning_starts": 100,
    "ent_coef": "auto_0.1",
}

model_sac = agent.get_model("sac",model_kwargs = SAC_PARAMS)

if if_using_sac:
  # set up logger
  tmp_path = RESULTS_DIR + '/sac'
  new_logger_sac = configure(tmp_path, ["stdout", "csv", "tensorboard"])
  # Set new logger
  model_sac.set_logger(new_logger_sac)

{'batch_size': 128, 'buffer_size': 100000, 'learning_rate': 0.0001, 'learning_starts': 100, 'ent_coef': 'auto_0.1'}
Using cpu device
Logging to results/sac


In [26]:
trained_sac = agent.train_model(model=model_sac, 
                             tb_log_name='sac',
                             total_timesteps=50000) if if_using_sac else None

day: 898, episode: 230
begin_total_asset: 1000000.00
end_total_asset: 1000000.00
total_reward: 0.00
total_cost: 0.00
total_trades: 0
---------------------------------
| time/              |          |
|    episodes        | 4        |
|    fps             | 201      |
|    time_elapsed    | 17       |
|    total_timesteps | 3596     |
| train/             |          |
|    actor_loss      | 7.89e+03 |
|    critic_loss     | 5.75e+04 |
|    ent_coef        | 0.131    |
|    ent_coef_loss   | 19       |
|    learning_rate   | 0.0001   |
|    n_updates       | 3495     |
|    reward          | 0.0      |
---------------------------------
---------------------------------
| time/              |          |
|    episodes        | 8        |
|    fps             | 197      |
|    time_elapsed    | 36       |
|    total_timesteps | 7192     |
| train/             |          |
|    actor_loss      | 6.14e+03 |
|    critic_loss     | 2.08e+05 |
|    ent_coef        | 0.187    |
|    ent_coef_los

In [27]:
trained_sac.save(TRAINED_MODEL_DIR + "/agent_sac") if if_using_sac else None

### In-sample Performance
Assume that the initial capital is $1,000,000.

#### Set turbulence threshold
Set the turbulence threshold to be greater than the maximum of insample turbulence data. If current turbulence index is greater than the threshold, then we assume that the current market is volatile

### Trading (Out-of-sample Performance)
We update periodically in order to take full advantage of the data, e.g., retrain quarterly, monthly or weekly. We also tune the parameters along the way, in this notebook we use the in-sample data from 2009-01 to 2020-07 to tune the parameters once, so there is some alpha decay here as the length of trade date extends.

Numerous hyperparameters – e.g. the learning rate, the total number of samples to train on – influence the learning process and are usually determined by testing some variations.

In [28]:
e_trade_gym = StockTradingEnv(df = trade, turbulence_threshold = 70,risk_indicator_col='vix', **env_kwargs)
# env_trade, obs_trade = e_trade_gym.get_sb_env()

In [29]:
from stable_baselines3 import A2C, DDPG, PPO, SAC, TD3

trained_a2c = A2C.load("trained_models/agent_a2c") if if_using_a2c else None
trained_ddpg = DDPG.load("trained_models/agent_ddpg") if if_using_ddpg else None
trained_ppo = PPO.load("trained_models/agent_ppo") if if_using_ppo else None
trained_td3 = TD3.load("trained_models/agent_td3") if if_using_td3 else None
trained_sac = SAC.load("trained_models/agent_sac") if if_using_sac else None

In [32]:
trained_model = trained_a2c
df_account_value_a2c, df_actions_a2c = DRLAgent.DRL_prediction(
    model=trained_model, 
    environment = e_trade_gym)

hit end!


In [33]:
trained_model = trained_ddpg
df_account_value_ddpg, df_actions_ddpg = DRLAgent.DRL_prediction(
    model=trained_model, 
    environment = e_trade_gym)

hit end!


In [34]:
trained_model = trained_ppo
df_account_value_ppo, df_actions_ppo = DRLAgent.DRL_prediction(
    model=trained_model, 
    environment = e_trade_gym)

hit end!


In [35]:
trained_model = trained_td3
df_account_value_td3, df_actions_td3 = DRLAgent.DRL_prediction(
    model=trained_model, 
    environment = e_trade_gym)

hit end!


In [36]:
trained_model = trained_sac
df_account_value_sac, df_actions_sac = DRLAgent.DRL_prediction(
    model=trained_model, 
    environment = e_trade_gym)

hit end!


## Part 7: Backtesting Results
Backtesting plays a key role in evaluating the performance of a trading strategy. Automated backtesting tool is preferred because it reduces the human error. We usually use the Quantopian pyfolio package to backtest our trading strategies. It is easy to use and consists of various individual plots that provide a comprehensive image of the performance of a trading strategy.

In [37]:
df_result_a2c = df_account_value_a2c.set_index(df_account_value_a2c.columns[0])
df_result_a2c.rename(columns = {'account_value':'a2c'}, inplace = True)
df_result_ddpg = df_account_value_ddpg.set_index(df_account_value_ddpg.columns[0])
df_result_ddpg.rename(columns = {'account_value':'ddpg'}, inplace = True)
df_result_td3 = df_account_value_td3.set_index(df_account_value_td3.columns[0])
df_result_td3.rename(columns = {'account_value':'td3'}, inplace = True)
df_result_ppo = df_account_value_ppo.set_index(df_account_value_ppo.columns[0])
df_result_ppo.rename(columns = {'account_value':'ppo'}, inplace = True)
df_result_sac = df_account_value_sac.set_index(df_account_value_sac.columns[0])
df_result_sac.rename(columns = {'account_value':'sac'}, inplace = True)

result = pd.DataFrame()
result = pd.merge(result, df_result_a2c, how='outer', left_index=True, right_index=True)
result = pd.merge(result, df_result_ddpg, how='outer', left_index=True, right_index=True)
result = pd.merge(result, df_result_td3, how='outer', left_index=True, right_index=True)
result = pd.merge(result, df_result_ppo, how='outer', left_index=True, right_index=True)
result = pd.merge(result, df_result_sac, how='outer', left_index=True, right_index=True)
print(result.head())
# result.columns = ['a2c', 'ddpg', 'td3', 'ppo', 'sac', 'mean var', 'dji']

# print("result: ", result)
result.to_csv(RESULTS_DIR + "/result.csv")

                     a2c       ddpg           td3           ppo        sac
date                                                                      
2023-01-03  1.000000e+06  1000000.0  1.000000e+06  1.000000e+06  1000000.0
2023-01-04  1.009352e+06  1000000.0  1.009352e+06  1.004914e+06  1000000.0
2023-01-05  1.008260e+06  1000000.0  1.008260e+06  1.003350e+06  1000000.0
2023-01-06  1.015269e+06  1000000.0  1.015269e+06  1.010359e+06  1000000.0
2023-01-09  1.028698e+06  1000000.0  1.028698e+06  1.023788e+06  1000000.0


In [None]:
%matplotlib inline
plt.rcParams["figure.figsize"] = (15,5)
plt.figure();
result.plot();