# Training deep RL agents to achieve 1.6% net returns daily

**While information here is presented in good faith, it does not constitue financial advice or replace a qualified financial advisor. This notebook is intended for educational purposes only. Use at your own risk.**

If you're just looking for directions to deploy your own automated trading bot, please skip to the final section in this tutorial.

Automated stock trading is a very real world problem that accenuates the challanges and limitations of supervised learning, the simulation-reality gap, continual learning, and hyperparameter tuning. Unlike the real market -- which is highly chaotic -- datasets do not respond to trading behavior. Reinforcement learning simulators too can only provide samples of synthetic market behavior, and even this does not accurately reflect the behavior of human investors. The automated trading problem is also nonstationary -- the market can change at any time and agents adapted to previous trends might not be able to operate competitively under the new market conditions. Even continual learning powered solutions may eventually degenerate and require manual intervention, re-tuning, or architecture re-design. Developing profitable trading agents is therefore an excellent way to jump into the bleeding edge of machine learning.

This tutorial walks through training and deploying a high frequency (minute-level) deep reinforcement learning stock trading agent using tensorflow and alpaca. We'll start by overviewing the general problem. Then we'll look at the Alpaca API and how it can be used to trade stocks. Next, we'll propose a few candidate agent and training architectures to experiment with. Then, we will build a training pipeline and run it on historical data. After analysis and hyperparameter tuning, we'll test-deploy our agents on live paper-trading markets. Finally, we'll set up daily email notifications, schedule automated GCP deployments during market hours, and let our agents loose in the wild (the IEX exchange).

## Problem Setup

Stock trading basically aims to make a profit by buying low and selling high. Formally, given a sequence of average stock prices (vwap) $p \in \mathbb{R}^{T \times N_s}$, portfolio holdings $h \in \mathbb{Z}^{T \times N_s}$, cash $c \in \mathbb{R}^T$, and other per-stock and market-level trading signals $s_{indv} \in \mathbb{R}^{T \times (N_s N_{indv\ sig})}$ and $s_{mkt} \in \mathbb{R}^{T \times N_{mkt\ sig}}$ respectively, the trading agent must execute trading decisions (bid/hold/offer choice and count $d \in \mathbb{Z}^{T \times N_s}$, max bid price $p^{bid} \in \mathbb{R}^{T \times N_s}$, and min offer price $p^{offer} \in \mathbb{Z}^{T \times N_s}$) for each time step $t \in [0,T]$ for $N_s$ stocks such that net worth (reward) $r_t = c_t + \sum_{\forall\ i \in \text{stocks}} h^i_t p^i_t$ is maximized. Observations and actions are concatenated along the stock dimension for each timestep with market-level variables structured in a separate flat tensor as follows: $o = ( o_{mkt}, o_{indv} )$, $o_{indv} = [p; h; s_{indv}; d; p^{bid}; p^{offer}] \in \mathbb{N/R}^{T \times N_s \times \cdots}$, $o_{mkt} = [c; s_{mkt}; r]  \in \mathbb{N/R}^{T \times \cdots}$, and $a = [d, p^{bid}, p^{offer}] \in \mathbb{N/R}^{T \times N_s \times 3}$.

The bid/hold/offer choice and count $d^i_t$ is interpreted in three cases:
- $d^i_t = 0$ results in stock $i$ being held at time $t$,
- $d^i_t < 0$ results in $\max \{ |d^i_t| , h^i_t \}$ shares of stock $i$ being offered for sale at a price of $p^{offer,i}_t$, and
- $d^i_t > 0$ results in $d^i_t$ shares of stock $i$ (or as many as can be afforded) being bid on for purchase at a price of $p^{bid,i}_t$.

Since the agent makes purchase bids and market offers, its decisions do not necesarily result in a transaction depending on the market's condition. For example, if the agent has $h^i_t = 0$ and $d^i_t = -1$ (i.e., it wants to sell one share of stock $i$), no shares will actually be offered for sale.

Additional market signals used to drive the agent's behavior are:
- Share price volatility
- Market volatility
- Sharpe ratio
- Market sharpe ratio
- Share moving average
- Market moving average
- Forecasted share price

TODO. I should make the dataset and environment here so we can see exactly what an observation or action looks like. Actually, I should split the above cell and put this one in between the two.

## Training Loop

Succintly, we can now describe a primary optimization objective of $\pi^{*} = \max_\pi \sum \lambda^t r_t$ where $\lambda^t \in [0,1]$ is the reward discount. Values of $\lambda$ near 0 result in immediate payoff being maximized, while values near 1 result in long-term payoff being maximized.

Reinforcement learning only provides a single feedback signal at each timestep. Therefore, training is slow and requires a large number of training iterations. Research in multitask learning has shown that, in many cases, augmenting a model's training paradigm with auxillary objectives results in faster convergence and superior performance -- even when these auxillary objectives outnumber the final task objective in terms of number of training examples seen. I'm going to make the assumption that training the base layers of the trading agent to perform forward and backward autoregressive modeling of $o$ for each timestep should encode important information about market dynamics into the weights of those layers. Self-supervised autoregressive training epochs will be probabblistically interspersed by portfolio-maximization reinforcement learning episodes with the ratio of self-supervised to reinforcement learning epochs decreasing as training epochs progress.

**Architecture**

sequence transformations
- vanilla
- forier features
- multiscale view

base layers
- {attention,max,ave} pooling over all stocks
- {CNN,RNN,LSTM,linformer} over sequence

multiple agents
- comptetive vs. cooperative
- agents can communicate with internal channels
- agents observe each others actions

decision output head
- vanilla output unit for each stock
- trainable weighted 0th, 1st, and 2nd order integrator
- powers of two to Nmax output units for each stock

## Getting Started

In [1]:
import timeit
import getpass

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.backend as K
import tensorflow.keras.layers as tfkl
import tensorflow.keras.models as tfkm

2022-01-06 17:21:37.082644: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/cuda/include:/usr/lib/cuda/lib64::/home/jacob/.mujoco/mujoco200/bin:/usr/local/pulse/extra/usr/lib/x86_64-linux-gnu/
2022-01-06 17:21:37.082693: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


ModuleNotFoundError: No module named 'tensroflow'

## Alpaca Trading API

## Alpaca Environment

## Reinforcement Learning

## Deploying the Alpaca Trading Bot