In this Notebook, we will train a Reinforcement Learing Agent to maximize the cumulative return by having a portfolio that comprises of the individual stocks from the S&P100. The agent will perform the portfolio allocation and maximize its cumulative return over the training horizon. We will show at the end the out-of-sample performance. 

Several important criterias as follows:
1. This is a single-period RL system. We intend to extend this to a multi-period approach by the next submission. 
2. The training dataset is between 01-01-1990 and 01-08-2017 whereas the testing dataset is between 01-09-2017 and 01-07-2024.
3. We will use a monthly frequency instead of daily.
4. The following features are used in this notebook: closing price, high price, low price
5. In the RL setting:
    - The state is a 3d tensor with shape (f,n,m). F: number of features, n: number of input periods, m: number of assets
    - The portfolio reward is the logarithmic return of the portfolio
    - The model used is called Ensemble of Identical Independent Evaluators: https://arxiv.org/abs/1706.10059
    - The agent makes an action of selecting weights/allocations for these stocks
6. For Benchmark comparison in the other Notebook, we have an Multi-Period Markowitz's Mean Variance and the S&P100 index itself.

In [1]:
import matplotlib.pyplot as plt
import yfinance as yf
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
sns.set()

In [2]:
import logging
logging.getLogger('matplotlib.font_manager').disabled = True

In [3]:
import torch

import numpy as np

from sklearn.preprocessing import MaxAbsScaler
from finrl.meta.env_portfolio_optimization.env_portfolio_optimization import PortfolioOptimizationEnv
from finrl.agents.portfolio_optimization.models import DRLAgent
from finrl.agents.portfolio_optimization.architectures import EIIE

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class GroupByScaler(BaseEstimator, TransformerMixin):
    """Sklearn-like scaler that scales considering groups of data.

    In the financial setting, this scale can be used to normalize a DataFrame
    with time series of multiple tickers. The scaler will fit and transform
    data for each ticker independently.
    """

    def __init__(self, by, scaler=MaxAbsScaler, columns=None, scaler_kwargs=None):
        """Initializes GoupBy scaler.

        Args:
            by: Name of column that will be used to group.
            scaler: Scikit-learn scaler class to be used.
            columns: List of columns that will be scaled.
            scaler_kwargs: Keyword arguments for chosen scaler.
        """
        self.scalers = {}  # dictionary with scalers
        self.by = by
        self.scaler = scaler
        self.columns = columns
        self.scaler_kwargs = {} if scaler_kwargs is None else scaler_kwargs

    def fit(self, X, y=None):
        """Fits the scaler to input data.

        Args:
            X: DataFrame to fit.
            y: Not used.

        Returns:
            Fitted GroupBy scaler.
        """
        # if columns aren't specified, considered all numeric columns
        if self.columns is None:
            self.columns = X.select_dtypes(exclude=["object"]).columns
        # fit one scaler for each group
        for value in X[self.by].unique():
            X_group = X.loc[X[self.by] == value, self.columns]
            self.scalers[value] = self.scaler(**self.scaler_kwargs).fit(X_group)
        return self

    def transform(self, X, y=None):
        """Transforms unscaled data.

        Args:
            X: DataFrame to transform.
            y: Not used.

        Returns:
            Transformed DataFrame.
        """
        # apply scaler for each group
        X = X.copy()
        for value in X[self.by].unique():
            select_mask = X[self.by] == value
            X.loc[select_mask, self.columns] = self.scalers[value].transform(
                X.loc[select_mask, self.columns]
            )
        return X


Using the 101 Stocks from the S&P100

In [None]:
sp_100 = [
    'AAPL', 'ABBV', 'ABT', 'ACN', 'ADBE', 'AIG', 'AMD', 'AMGN', 'AMT', 'AMZN',
    'AVGO', 'AXP', 'BA', 'BAC', 'BK', 'BKNG', 'BLK', 'BMY', 'BRK-B', 'C',
    'CAT', 'CHTR', 'CL', 'CMCSA', 'COF', 'COP', 'COST', 'CRM', 'CSCO', 'CVS',
    'CVX', 'DE', 'DHR', 'DIS', 'DUK', 'EMR', 'F', 'FDX', 'GD',
    'GE', 'GILD', 'GM', 'GOOG', 'GOOGL', 'GS', 'HD', 'HON', 'IBM', 'INTC',
    'INTU', 'JNJ', 'JPM', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'MA',
    'MCD', 'MDLZ', 'MDT', 'MET', 'META', 'MMM', 'MO', 'MRK', 'MS', 'MSFT',
    'NEE', 'NFLX', 'NKE', 'NVDA', 'ORCL', 'PEP', 'PFE', 'PG', 'PM',
    'QCOM', 'RTX', 'SBUX', 'SCHW', 'SO', 'SPG', 'T', 'TGT', 'TMO', 'TMUS',
    'TSLA', 'TXN', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VZ', 'WFC', 'WMT',
    'XOM', 'DOW', 'PYPL', 'KHC'
]
len(sp_100)

Data Collection and Data Preprocessing

In [None]:
portfolio_raw_df = yf.download(tickers=sp_100, start="1990-01-01", end="2024-08-01", interval="1mo")
portfolio_raw_df.fillna(method="bfill", inplace=True)
portfolio_raw_df = portfolio_raw_df.stack(level=1).rename_axis(["Date", "Ticker"]).reset_index(level=1)

portfolio_raw_df = portfolio_raw_df.drop("Adj Close", axis=1)
portfolio_raw_df.columns.name = None
portfolio_raw_df = portfolio_raw_df.reset_index()
portfolio_raw_df.Date = portfolio_raw_df.Date.astype(str)
portfolio_raw_df.columns = ["date", "tic", "close", "high", "low", "open", "volume"]

In [None]:
# Ensuring the data is complete and there is no NaNs. the bar plots show they all have the same count
plt.figure(figsize=(4,2))
plt.bar(np.arange(0, len(sp_100), 1), portfolio_raw_df.groupby("tic").count().mean(1))
plt.show()

In [None]:
portfolio_norm_df = GroupByScaler(by="tic", scaler=MaxAbsScaler).fit_transform(portfolio_raw_df)
df_portfolio = portfolio_norm_df[["date", "tic", "close", "high", "low", "open"]]
df_portfolio_train, df_portfolio_test = train_test_split(df_portfolio, test_size=0.2, shuffle=False, random_state=43)
len(df_portfolio_train), len(df_portfolio_test)

Setting Hyperparameters

In [None]:
T = 24 # last 2 years of data
num_features = 3

Initializing the Portfolio Optimization Environment

In [None]:
environment = PortfolioOptimizationEnv(
        df_portfolio_train,
        initial_amount=100000,
        comission_fee_pct=0.0025,
        time_window=T,
        features=["close", "high", "low"],
        normalize_df=None
    )

In [None]:
# set PolicyGradient parameters
model_kwargs = {
    "lr": 0.01,
    "policy": EIIE,
}

# here, we can set EIIE's parameters
policy_kwargs = {
    "k_size": 3,
    "time_window": T,
}

model = DRLAgent(environment).get_model("pg", device, model_kwargs, policy_kwargs)

Model Training with 50 episodes

In [None]:
DRLAgent.train_model(model, episodes=50)

In [None]:
torch.save(model.train_policy.state_dict(), "policy_EIIE.pt")

Setting the test environment

In [None]:
environment_test = PortfolioOptimizationEnv(
        df_portfolio_test,
        initial_amount=100000,
        comission_fee_pct=0.0025,
        time_window=T,
        features=["close", "high", "low"],
        normalize_df=None
)

In [None]:
EIIE_results = {
    "training": environment._asset_memory["final"],
    "test": {}
}

# instantiate an architecture with the same arguments used in training
# and load with load_state_dict.
policy = EIIE(k_size=3, 
              time_window=T,
              device=device)
policy.load_state_dict(torch.load("policy_EIIE.pt"))

DRLAgent.DRL_validation(model, environment_test, policy=policy)
EIIE_results["test"]["value"] = environment_test._asset_memory["final"]

In [None]:
out_of_sample_df = pd.DataFrame(index=pd.to_datetime(df_portfolio_test.date.unique()[T-1:]),
                                data=EIIE_results["test"]["value"], columns=["RL"])
out_of_sample_metrics = out_of_sample_df.pct_change().fillna(0)
out_of_sample_df = (out_of_sample_df/out_of_sample_df.iloc[0]) - 1

In [None]:
fig, ax = plt.subplots(figsize=(6,3))
ax.plot(out_of_sample_df.index, out_of_sample_df, label="RL")
ax.set_ylabel('Return')
ax.set_xlabel("Date")
ax.legend()
plt.title("Portfolio Cumulative Return")
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()

Preparing Statistics for portfolio performance

In [None]:
import quantstats as qs

In [None]:
rl_sharpe_ratio = qs.stats.sharpe(out_of_sample_metrics, periods=12).item()
rl_mdd = qs.stats.max_drawdown(out_of_sample_metrics).item()
rl_fapv = out_of_sample_df.iloc[-1].item()
print(f"RL Sharpe Ratio: {rl_sharpe_ratio}")
print(f"RL Max Drawdown: {rl_mdd}")
print(f"RL final Accumulated Portfolio Value (fAPV): {rl_fapv}x")

Findings

Our finding from the Reinforcement Learning model shows that the agent was able to make more money than the benchmark but it has a way lower Sharpe ratio. This can be mitigated by better training and through the addition of more features such as technical analysis, longer lookback period and other forms of alternative data such as sentiment. We also have to train these models multiple times and average the results to have a more consistent result on model performance. Nonetheless, here we want to show that a simple RL agent can beat the benchmarks and it can be further fine-tuned and improved for better performance. In our older submission, we showed that we were able to have Deep Learning architectures that can learn better such as Transformers. In this Example, the Agent has a Policy Algorithm of Ensemble of Identical Independent Evaluators (EIIE), which is a series of convolutions trained on understanding the short-term and long term trends of these stocks. We will use other Policy Gradients such as Transformer model as our previous submission and test its predictive capability in assigning portfolio weights for optimal allocation. 