In [1]:
import os
import pandas as pd
import gymnasium as gym

from finrl.main import check_and_make_directories
from finrl.main import INDICATORS, TRAINED_MODEL_DIR, RESULTS_DIR

check_and_make_directories([TRAINED_MODEL_DIR])

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
Users of this version of Gym should be able to simply replace 'import gym' with 'import gymnasium as gym' in the vast majority of cases.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


### Why The Offline Approach

1. **Offline Learning**: Decision Transformers are designed for offline RL - theylearn from existing trajectories rather than interacting with the enviroment during training.
2. **Expert Demonstrations**: The PPO model serves as an "expert" that provides high-quality trading trajectories. The DT learns to mimic this expert behavior.
3. **Conditional Generation**: Unlike PPO which learns a policy directly, the DT learns to generate actions conditioned on:

* Current states
* Desired returns-to-go (future performance targets)
* Timesteps

4. **Flexibility**: Once trained, the DT can generate actions for different return targets without retraining, while PPO is fixed to its learned policy.

In [2]:
train = pd.read_csv('data/train.csv')

train = train.set_index(train.columns[0])
train.index.names = ['']

In [3]:
train.head()

Unnamed: 0,date,close,high,low,open,volume,tic,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,vix,turbulence
,,,,,,,,,,,,,,,,,,
0.0,2009-01-02,2.724327,2.733033,2.556515,2.578129,746015200.0,AAPL,4.0,0.0,2.944414,2.619214,100.0,66.666667,100.0,2.724327,2.724327,39.189999,0.0
0.0,2009-01-02,40.463203,40.524938,39.612645,40.188829,6547900.0,AMGN,4.0,0.0,2.944414,2.619214,100.0,66.666667,100.0,40.463203,40.463203,39.189999,0.0
0.0,2009-01-02,14.854059,15.000064,14.139404,14.27004,10955700.0,AXP,4.0,0.0,2.944414,2.619214,100.0,66.666667,100.0,14.854059,14.854059,39.189999,0.0
0.0,2009-01-02,33.941093,34.173619,32.088396,32.103398,7010200.0,BA,4.0,0.0,2.944414,2.619214,100.0,66.666667,100.0,33.941093,33.941093,39.189999,0.0
0.0,2009-01-02,30.233912,30.279027,28.815991,28.944894,7117200.0,CAT,4.0,0.0,2.944414,2.619214,100.0,66.666667,100.0,30.233912,30.233912,39.189999,0.0


In [4]:
train.tic.unique(), INDICATORS

(array(['AAPL', 'AMGN', 'AXP', 'BA', 'CAT', 'CRM', 'CSCO', 'CVX', 'DIS',
        'GS', 'HD', 'HON', 'IBM', 'INTC', 'JNJ', 'JPM', 'KO', 'MCD', 'MMM',
        'MRK', 'MSFT', 'NKE', 'PG', 'TRV', 'UNH', 'V', 'VZ', 'WBA', 'WMT'],
       dtype=object),
 ['macd',
  'boll_ub',
  'boll_lb',
  'rsi_30',
  'cci_30',
  'dx_30',
  'close_30_sma',
  'close_60_sma'])

In [5]:
stock_dimension = len(train.tic.unique())
state_space = 1 + 2*stock_dimension + len(INDICATORS)*stock_dimension
print(f'Stock Dimension: {stock_dimension}', f'State Space: {state_space}')

Stock Dimension: 29 State Space: 291


### Stock Universe

The model trades 29 stocks from Dow Jones Industrial Average

* **Stocks:** AAPL, AMGN, AXP, BA, CAT, CSCO, CVX, DIS, GS, HD, HON, IBM, INTC, JNJ, JPM, KO, MCD, MMM, MRK, MSFT, NKE, PG, TRV, UNH, V, VZ, WBA, WMT

### Technical Indicators

The environment uses 8 technical indicators for each stock:

1. **MACD** - Moving Average Convergence Divergence
2. **Bollinger Upper Band**
3. **Bollinger Lower Band**
4. **RSI (30-period)** - Relative Strength Index
5. **CCI (30-period)** - Commodity Channel Index
6. **DX (30-period)** - Directional Movement Index
7. **Close 30-day SMA** - Simple Moving Average
8. **Close 60-day SMA** - Simple Moving Average

### State Space Composition

The state space has 291 dimensions calculated as follows:

$$\text{State Space}=\text{Cash Balance}+2\cdot\text{Stock Dimensions}+\text{Indicators}\cdot\text{Stock Dimensions}$$

### Action Space

* 29-dimension continouus action space
* Each action represents the number of shares buy/sell for each stock
* Actions are bounded by `hmax` (100 shares maximum per trade)

### Trading Constraints

* **Transaction Costs:** 0.5\% for both buying and selling (training)
* **Position Limits:** Maximum 100 shares per stock per trade
* **Initial Capital:** \$1,000,000
* **Reward Scaling:** $e^{-4}$

### Data Structure
* **Training Period**: Historical data with 3,396 trading days
* **Total Data Points**: 98,513 observations ($29\times 3,396$ days)
* **Features**: OHLCV data + technical indicators + VIX + turbulence index

This enviroment simulates realistic stock trading with transaction cost, poistion limits, and uses comprehensive technical analysis indicators to inform trading decisions. The model learns to optimize portfolio allocation across the 29 stocks to maximum returns while managing risk.

In [6]:
from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv

buy_cost_list = sell_cost_list = [0.005] * stock_dimension
num_stock_shares = [0] * stock_dimension

env_kwargs = {
    'hmax':100,
    'initial_amount': 1000000,
    'num_stock_shares': num_stock_shares,
    'buy_cost_pct': buy_cost_list,
    'sell_cost_pct': sell_cost_list,
    'state_space': state_space,
    'stock_dim': stock_dimension,
    'tech_indicator_list': INDICATORS,
    'action_space': stock_dimension,
    'reward_scaling': 1e-4
}

e_train_gym = StockTradingEnv(df=train, **env_kwargs)
env_train, _ = e_train_gym.get_sb_env()

In [7]:
len(e_train_gym.df.index.unique()) - 1

3396

In [8]:
e_train_gym.df.tic.count()

np.int64(98513)

In [9]:
from stable_baselines3 import PPO
from finrl.agents.stablebaselines3.models import DRLAgent
from stable_baselines3.common.logger import configure

agent = DRLAgent(env = env_train)
model_ppo = agent.get_model('ppo')

tmp_path = RESULTS_DIR + '/ppo'
new_logger_ppo = configure(tmp_path, ['stdout', 'csv', 'tensorboard'])

model_ppo.set_logger(new_logger_ppo)

trained_ppo = agent.train_model(model=model_ppo,
                                tb_log_name='ppo',
                                total_timesteps=50000)

{'n_steps': 2048, 'ent_coef': 0.01, 'learning_rate': 0.00025, 'batch_size': 64}
Using cuda device
Logging to results/ppo




-----------------------------------
| time/              |            |
|    fps             | 134        |
|    iterations      | 1          |
|    time_elapsed    | 15         |
|    total_timesteps | 2048       |
| train/             |            |
|    reward          | 0.14817615 |
-----------------------------------
----------------------------------------
| time/                   |            |
|    fps                  | 128        |
|    iterations           | 2          |
|    time_elapsed         | 31         |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.02183218 |
|    clip_fraction        | 0.259      |
|    clip_range           | 0.2        |
|    entropy_loss         | -41.2      |
|    explained_variance   | -0.0117    |
|    learning_rate        | 0.00025    |
|    loss                 | 2.45       |
|    n_updates            | 10         |
|    policy_gradient_loss | -0.0288    |
|    reward         

In [10]:
trained_ppo.save(TRAINED_MODEL_DIR + '/agent_ppo')

In [11]:
model = PPO.load('trained_models/agent_ppo.zip')



Here we are training data for the Decision Transformer (DT) model by running pre-trained PPO (Proximal Policy Optimization) reinformcement learning model through a stock trading enviroment.

### Main Loop (Lines 16-37)

The loop runs through the trading environment step-by-step. This creates offline trajectories with

* **States**: Market observations (291-dimensional state spaces)
* **Actions**: PPO's trading decisions (29-dimensional action space)
* **Rewards**: Trading performance
* **Dones**: Episode terminiation flags



In [None]:
import numpy as np

"""make a prediction and get results"""
env_train, obs = e_train_gym.get_sb_env()

ds = []
states = []
feature = {}

s, a, r, d = [], [], [], []

env_train.reset()
#max_steps = len(e_train_gym.df.index.unique()) - 1
max_steps = e_train_gym.df.tic.count() - 1

for i in range(1, max_steps, 1):

    action, _states = model.predict(obs, deterministic=true)
    s.extend(obs)
    a.extend(action)

    obs, rewards, dones, info = env_train.step(action)
    r.extend(rewards)
    d.append(dones[0])

    states.extend(obs)

    if (i % 100 == 0):
        
        feature['observations'] = s
        feature['actions'] = a
        feature['rewards'] = r
        feature['dones'] = d
        
        ds.append(feature)
        feature = {}
        s, a, r, d = [], [], [], []

day: 3396, episode: 20
begin_total_asset: 1000000.00
end_total_asset: 7416828.80
total_reward: 6416828.80
total_cost: 12279.06
total_trades: 52303
Sharpe: 0.968
day: 3396, episode: 30
begin_total_asset: 1000000.00
end_total_asset: 7416828.80
total_reward: 6416828.80
total_cost: 12279.06
total_trades: 52303
Sharpe: 0.968
day: 3396, episode: 40
begin_total_asset: 1000000.00
end_total_asset: 7416828.80
total_reward: 6416828.80
total_cost: 12279.06
total_trades: 52303
Sharpe: 0.968


### State Staistics

This calculates the mean and standard deviation of all collected states, whcih will be used for normalization in the Decision Transformer training.

The purpose is to prepare data for limitation learning. It's collecting expert demonstrations from a trained RL agent (PPO) to train a Deceision Transformer mdoel. The DT will learn to replicate the PPO agent's behavior by observing the state-action-reward sequences.

The data structure `ds` contains batches of experiened tuples (observations, actions, rewards, dones) that will be used to train the Decision Transformer to make similiar trading decisions as the PPO model.


In [13]:
states = np.vstack(states)
state_mean, state_std = np.mean(states, axis=0), np.std(states, axis=0) + 1e-6

In [14]:
state_mean[:5], state_std[:5], state_mean.shape

(array([4.3963367e+04, 4.1790710e+01, 1.1280511e+02, 7.4285767e+01,
        1.5208690e+02], dtype=float32),
 array([1.6916192e+05, 4.3230621e+01, 5.9250042e+01, 3.7927383e+01,
        1.0245756e+02], dtype=float32),
 (291,))

In [15]:
len_ds = len(ds)

state_mean = np.pad(state_mean, (0, (len_ds-state_space)))
state_std = np.pad(state_std, (0, (len_ds-state_space)))

In [16]:
state_mean, len(state_mean)

(array([4.39633672e+04, 4.17907104e+01, 1.12805107e+02, 7.42857666e+01,
        1.52086899e+02, 9.07077408e+01, 9.35620880e+01, 2.48966351e+01,
        7.07632751e+01, 8.43889084e+01, 1.63063492e+02, 1.14356506e+02,
        9.23569031e+01, 9.81931458e+01, 2.87876167e+01, 8.41212234e+01,
        6.18488350e+01, 3.15784149e+01, 1.08215553e+02, 8.87638321e+01,
        4.15294914e+01, 8.19118958e+01, 5.38253441e+01, 7.01174393e+01,
        8.37682190e+01, 1.45445206e+02, 8.84838943e+01, 2.87105122e+01,
        3.87024231e+01, 2.41100769e+01, 5.95184278e+00, 1.28901642e+02,
        2.36057373e+02, 1.63312183e+03, 3.75818146e+02, 4.07814911e+02,
        1.67866621e+04, 9.30251479e-02, 2.70149414e+02, 0.00000000e+00,
        2.53899487e+03, 8.81236649e+00, 0.00000000e+00, 3.92849541e+00,
        4.09192890e-02, 1.79700432e+01, 1.10966516e+03, 8.56394922e+03,
        4.97284470e+01, 9.89319275e+02, 6.81661865e+03, 1.82795227e+02,
        8.83150089e-04, 6.32831238e+02, 8.24273471e-03, 5.368394

In [17]:
len(ds), len(ds[0])

(985, 4)

In [18]:
feature = ds[0]
len(feature['rewards'])

100

In [19]:
input_data = {}
input_data['train'] = ds
input_data['state_mean'] = state_mean
input_data['state_std'] = state_std

In [20]:
input_data.keys()

dict_keys(['train', 'state_mean', 'state_std'])

In [21]:
from datasets import Dataset

dataset = Dataset.from_dict(input_data)

In [22]:
dataset.save_to_disk("data/dataset/")

Saving the dataset (0/1 shards):   0%|          | 0/985 [00:00<?, ? examples/s]

In [23]:
from datasets import load_from_disk

dataset = load_from_disk("data/dataset/")