# Construct a custom Environment for Financial Trading

Some examples on the market
* [custom env example](https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/5_custom_gym_env.ipynb#scrollTo=RqxatIwPOXe_)
* [StockTradingEnv by Adam King](https://github.com/notadamking/Stock-Trading-Environment)
* [FinRL](https://github.com/AI4Finance-Foundation/FinRL)

Target is to construct a custom Env for pair trading

This env gives the RL learner freedom to operate whatever it wants. Even long n short simultaneously.

In [1]:
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from datetime import date
from sklearn.model_selection import train_test_split

from utils.read2df import read2df

Define data parameters

In [2]:
symbols = ['BTCUSDT', 'ETHUSDT', 'LTCUSDT', 'XMRUSDT', 'BNBUSDT', 'ADAUSDT', 'DOGEUSDT', 'SOLUSDT', 'TRXUSDT']
start_date = '2018-01-01'

# freqs = {'1h':60, '2h':120, '4h':240, '6h':360, '8h':480, '12h':720, '1d':1440}
freqs = {'1m':1, '3m':3, '5m':5, '15m':15, '30m':30}

Download data from `binance-public-data`

In [3]:
%%capture
if symbols is None:
    !python binance-public-data/python/download-kline.py -i {" ".join(list(freqs.keys()))} -startDate {start_date} -t spot -skip-daily 1
else:
    !python binance-public-data/python/download-kline.py -s {" ".join(symbols)} -i {" ".join(list(freqs.keys()))} -startDate {start_date} -t spot -skip-daily 1

In [4]:
# dfs = read2df(symbols, freqs)
dfs = read2df(symbols, freqs)

df0 = dfs[0][dfs[0]['tic']=='BTCUSDT'].reset_index(drop=True)
df1 = dfs[0][dfs[0]['tic']=='ETHUSDT'].reset_index(drop=True)

Set data before `trade_data` as training data, after `trade_data` is trade_data

In [5]:
trade_date = '2023-01-01'

train0 = df0[df0['datetime'] < trade_date]
train1 = df1[df1['datetime'] < trade_date]

trade0 = df0[df0['datetime'] >= trade_date]
trade1 = df1[df1['datetime'] >= trade_date]

In [6]:
# Don't use custom observation & action spaces
# See the warning on https://gymnasium.farama.org/api/spaces/

'''
class PairTradingActionSpace(gym.Space):
  def __init__(self, low=-1.0, high=1.0, shape=(2, ), dtype=np.float32):
    super().__init__(shape, dtype)
    self.low = low
    self.high = high

  def sample(self):
    action = np.random.uniform(self.low, self.high, self.shape)
    # Normalize the action so that the summation of action[0] and action[1] is within -1 and 1.
    action = action / np.linalg.norm(action)
    return action

  def contains(self, x):
    return np.all(self.low <= x) and np.all(x <= self.high) and np.linalg.norm(x) <= 1.0
'''

'\nclass PairTradingActionSpace(gym.Space):\n  def __init__(self, low=-1.0, high=1.0, shape=(2, ), dtype=np.float32):\n    super().__init__(shape, dtype)\n    self.low = low\n    self.high = high\n\n  def sample(self):\n    action = np.random.uniform(self.low, self.high, self.shape)\n    # Normalize the action so that the summation of action[0] and action[1] is within -1 and 1.\n    action = action / np.linalg.norm(action)\n    return action\n\n  def contains(self, x):\n    return np.all(self.low <= x) and np.all(x <= self.high) and np.linalg.norm(x) <= 1.0\n'

# Define the custom Environment

The RL learner can do whatever they want. 

We want to see if it can learn to be market-neutral itself.

In [14]:
# The lookback period for the observation space
PERIOD = 1440
CASH = 10000

class PairTradingEnv(gym.Env):
    metadata = {'render.modes': ['console']}

    # for pair trading, we need to feed in two OHLCV dataframes
    def __init__(self, df0, df1, tc=0.001):
        super().__init__()

        if not df0['time'].equals(df1['time']):
            raise ValueError("Two dataframe must have same time index")

        self.tic0 = df0['tic'].iloc[0]
        self.tic1 = df1['tic'].iloc[0]

        # transaction cost
        self.tc = tc

        # get two datasets
        self.df0 = df0[['time', 'open', 'high', 'low', 'close', 'volume']]
        self.df1 = df1[['time', 'open', 'high', 'low', 'close', 'volume']]

        self.reward_range = (-np.inf, np.inf)

        # -1 means short 100%, 1 means long 100%, 0 means do nothing
        self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(2, ), dtype=np.float32)

        # The data requires to be at least [time, open, high, low, close, volume]
        # Let's assume that we feed in previous 30 period data into the observation_space
        self.observation_space = spaces.Box(low=0.0, high=np.inf, shape=(2*PERIOD*6,), dtype=np.float64)

        # if the length is 35, then the index shall be 0~34
        self.max_steps = len(df0)-1

    def _next_observation(self):
        # The current step is always higher than the PERIOD as defined in the 

        obs_df0 = self.df0.iloc[self.current_step-PERIOD: self.current_step]
        obs_df1 = self.df1.iloc[self.current_step-PERIOD: self.current_step]

        obs = np.array([obs_df0, obs_df1]).flatten()

        return obs

    def _take_action(self, action):
        self.action = action

        current_price0 = self.df0['close'].iloc[self.current_step]
        current_price1 = self.df1['close'].iloc[self.current_step]

        # evaluate purchasing power 
        max_amount0 = self.net_worth/current_price0
        max_amount1 = self.net_worth/current_price1

        curr_holding0 = self.holding0/max_amount0
        curr_holding1 = self.holding1/max_amount1

        # clip the action to the summation of [-1, 1]
        if sum(self.action) > 1:
            action0 = self.action[0]/(sum(self.action)+self.tc)
            action1 = self.action[1]/(sum(self.action)+self.tc)
            self.action = [action0, action1]
        elif sum(self.action) < -1:
            action0 = self.action[0]/(sum(self.action)-self.tc)
            action1 = self.action[1]/(sum(self.action)-self.tc)

        # if curr_h is -70%, action is -40%, then we need to clip the action to -30%
        if curr_holding0 + self.action[0] > 1:
            self.action[0] = 1 - curr_holding0
        elif curr_holding0 + self.action[0] < -1:
            self.action[0] = -1 - curr_holding0

        if curr_holding1 + self.action[1] > 1:
            self.action[1] = 1 - curr_holding1
        elif curr_holding0 + self.action[0] < -1:
            self.action[1] = -1 - curr_holding1

        self.holding0 += self.action[0]*max_amount0
        self.holding1 += self.action[1]*max_amount1
        self.cash -= self.cash*sum(action)*(1+self.tc)

        # We record the net_worth from previous period to prev_net_worth
        self.prev_net_worth = self.net_worth
        self.net_worth = self.cash + self.holding0*current_price0 + self.holding1*current_price1

    def step(self, action):
        self._take_action(action)
        self.current_step += 1

        observation = self._next_observation()

        # TODO: what if I heavily punish loss?
        reward = self.net_worth - self.prev_net_worth
        
        terminated = bool(self.current_step >= self.max_steps)
        truncated = bool(self.net_worth <= 0)
        info = {}

        return observation, reward, terminated, truncated, info

    def reset(self, seed=None):
        np.random.seed(seed)
        
        self.cash = CASH
        self.net_worth = CASH
        self.max_net_worth = CASH
        self.holding0 = 0
        self.holding1 = 0

        self.current_step = np.random.randint(PERIOD, self.max_steps)

        return self._next_observation(), {}
    
    def render(self):
        profit = self.net_worth - CASH

        print("----------------------------------------")
        print(f"Current profit is {profit}, cash is {self.cash}, net worth is {self.net_worth}")
        print(f"Actions for this step is {self.tic0} for {self.action[0]} and {self.tic1} for {self.action[1]}")
        print(f"Current holding is {self.holding0} of {self.tic0} and {self.holding1} of {self.tic1}")

## Check with baselin3 `env_checker`

Check if the env meets the requirements of `stable_baseline3`

In [8]:
from stable_baselines3.common.env_checker import check_env

env = PairTradingEnv(train0, train1)
check_env(env)

## Do a test run with random generated actions

In [9]:
import random

env = PairTradingEnv(train0, train1)

obs, _ = env.reset()

print(f"observation_space: {env.observation_space}")
print(f"action_space: {env.action_space}")
print(f"action_space.sample: {env.action_space.sample()}")

n_steps = 100

for step in range(n_steps):
    print(f"Step {step + 1}")
    obs, reward, terminated, truncated, info = env.step(action=[random.uniform(-1, 1) for _ in range(2)])
    done = terminated or truncated
    env.render()
    if done:
        print("Test Finished!")
        break

observation_space: Box(0.0, inf, (12000,), float64)
action_space: Box(-1.0, 1.0, (2,), float32)
action_space.sample: [-0.33985785  0.5390888 ]
Step 1
Current profit is 10.120015260687069, cash is 20130.135275947374, net worth is 10010.120015260687
Actions for this step is BTCUSDT for -0.8602067228949191 and ETHUSDT for -0.15179480317374971
Current holding is -0.14137480298895078 of BTCUSDT and -0.37027638290950043 of ETHUSDT
Step 2
Current profit is 5460.813735583637, cash is 30965.816456415538, net worth is 15460.813735583637
Actions for this step is BTCUSDT for -0.14053751362783418 and ETHUSDT for -0.39720633040277065
Current holding is -0.16449211598018768 of BTCUSDT and -1.3396271650424723 of ETHUSDT
Step 3
Current profit is -1231.8525107725673, cash is 17619.996029542835, net worth is 8768.147489227433
Actions for this step is BTCUSDT for 0.0667777112417074 and ETHUSDT for 0.3637773157087132
Current holding is -0.14752822364214097 of BTCUSDT and 0.030911522974716554 of ETHUSDT
Ste

## PPO model from stable_baselines3

Train with training data

In [10]:
from stable_baselines3 import PPO

env = PairTradingEnv(train0, train1)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
model.save("ppo_pairtrading")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 5.86     |
|    ep_rew_mean     | -5.5e+04 |
| time/              |          |
|    fps             | 1345     |
|    iterations      | 1        |
|    time_elapsed    | 1        |
|    total_timesteps | 2048     |
---------------------------------
-------------------------------------------
| rollout/                |               |
|    ep_len_mean          | 6.33          |
|    ep_rew_mean          | -1.57e+05     |
| time/                   |               |
|    fps                  | 693           |
|    iterations           | 2             |
|    time_elapsed         | 5             |
|    total_timesteps      | 4096          |
| train/                  |               |
|    approx_kl            | 0.00019665368 |
|    clip_fraction        | 0             |
|    clip_range           | 0.2       

Explanation from AI

---

The text you've provided appears to be a summary of metrics and statistics related to some kind of training or optimization process. These values are often generated during the training of machine learning models, particularly reinforcement learning models like PPO (Proximal Policy Optimization) or other similar algorithms. Let's break down what each section means:

1. **Rollout**:
   - `ep_len_mean`: This is the mean (average) length of episodes. In a reinforcement learning context, episodes are sequences of actions taken by an agent in an environment until a termination condition is met.

   - `ep_rew_mean`: This is the mean (average) reward obtained in episodes. It seems to be very negative, which might indicate that the agent is not performing well or that the environment is very challenging.

2. **Time**:
   - `fps`: Frames per second. This tells you how fast the training is progressing in terms of processing frames or steps in the environment per second.

   - `iterations`: The number of training iterations that have been completed.

   - `time_elapsed`: The total time (in some unit, possibly seconds) that has elapsed during the training.

   - `total_timesteps`: The total number of timesteps or steps the agent has taken in the environment.

3. **Train**:
   - `approx_kl`: A measure of how much the current policy differs from the previous policy. It's used to control the rate of policy updates.

   - `clip_fraction`: The fraction of actions that were clipped during training. Clipping means that the policy update is bounded within a certain range to ensure stability.

   - `clip_range`: The range within which the policy update is clipped.

   - `entropy_loss`: A measure of the entropy of the policy distribution. It's often used to encourage exploration.

   - `explained_variance`: A measure of how well the value function (used to estimate expected rewards) explains the actual rewards. A value of 0 means no explanation, and 1 means perfect explanation.

   - `learning_rate`: The learning rate used in the optimization process.

   - `loss`: The overall loss of the model. In this case, it's a very large number, which might indicate a problem with the training process.

   - `n_updates`: The number of updates performed on the policy.

   - `policy_gradient_loss`: A loss term related to the policy gradient method used in reinforcement learning.

   - `std`: The standard deviation of the policy distribution.

   - `value_loss`: A loss term related to the value function, which estimates expected rewards.

These values are crucial for monitoring and understanding the progress of a reinforcement learning training process. To assess whether the training is successful or not, you would typically look at metrics like `ep_rew_mean` (average reward), `approx_kl` (policy stability), and `explained_variance` (how well the model explains rewards). The large values for `loss` and `value_loss` could be indicative of problems in the training process, but further analysis would be needed to diagnose the issue.

## Use the model on Trade data

In [11]:
del model
model = PPO.load("ppo_pairtrading")

In [16]:
env = PairTradingEnv(trade0, trade1)

env.reset()
while True:
    action, _states = model.predict(obs)
    observation, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    env.render()
    if done:
        print("Test Finished!")
        break

----------------------------------------
Current profit is 2.0548489689826965, cash is 12056.90381795168, net worth is 10002.054848968983
Actions for this step is BTCUSDT for -0.5627274513244629 and ETHUSDT for 0.35724255442619324
Current holding is -0.20555120112989672 of BTCUSDT and 2.0794700336225924 of ETHUSDT
----------------------------------------
Current profit is 2210.7607164796937, cash is 24958.493052998507, net worth is 12210.760716479694
Actions for this step is BTCUSDT for -0.06898924708366394 and ETHUSDT for -1.0
Current holding is -0.23075251800940827 of BTCUSDT and -3.742447806289505 of ETHUSDT
----------------------------------------
Current profit is 14430.445797355507, cash is 48851.96714034401, net worth is 24430.445797355507
Actions for this step is BTCUSDT for -0.48280754685401917 and ETHUSDT for -0.473564475774765
Current holding is -0.4461637350121562 of BTCUSDT and -7.109033643549234 of ETHUSDT
----------------------------------------
Current profit is 2199.73

Seems we went bankrupcy only after a few steps.

That means the learning does not work well if we only feed in historical a rolling frame