# Deep Reinforcement Learning on Stock Trading (CS230 Milestone)

## Metadata
Author: Bicheng Wang, Xinyi Zhang   
Email: bichengw@stanford.edu, xyzh@stanford.edu


## Introdution


Profitable automated trading strategy plays a critical role to investment companies and hedge funds. Given that the stock market is dynamic and complex, it is challenging to design such a strategy. The project proposes to use a deep reinforcement learning framework to learn a profitable stock trading mechanism, with the goal to optimize the cumulative return and Alpha.
It would select S\&P 500 Index along with its top 20 market capitalization stocks as our trading stock pool.
The input to our algorithm is the market price for these stocks, remaining balance, current portfolio and technical indicator statistics. The model agent output is a series of trading actions among stocks. 
The available trading action options are: sell, buy and hold. 

## Dataset

We choose 20 stocks with top performance in the S\\&500 index (exclude companies IPO after 2000) from 2000 to 2020 as our dataset. The original data is fetched from Yahoo Finance API. Each dataset row is comprised of date, open price, high price, low price, close price, volumn, ticker symbol, # day in a week. The dataset has 105680 rows in total. We split the dataset into training and test on a 90/10 basis. The training set contains data ranging from 2000 to 2018, while the test set contains data ranging from 2019 to 2020. The first 5 rows of the dataset are presented as follows:
![](img/original_dataset_5.png)

## Strategy Description

### Action and Environment Definition

#### State Space

A 337-dimensional vector consists of 17 parts of information to represent the state space of multiple stocks trading environment: 
***[b, p, s, macd, boll_ub, boll_lb, rsi_10, rsi_20, cci_10, cci_20, dx_30, close_20_sma, close_60_sma, close_120_sma, close_20_ema, close_60_ema, close_120_ema]***. 
Each component is defined as follows:
- **b**: available balance 
- **p**: close price of each stock
- **s**: shares owned of each stock
- **macd**: Moving Average Convergence Divergence of each stock
- **boll_ub**: Upper Bollinger Bands of each stock
- **boll_lb**: Lower Bollinger Bands of each stock
- **rsi_x**: Relative Strength Index of each stock, calculated using close price
- **cci_x**: Commodity Channel Index of each stock, calculated using high, low and close price
- **dx**: Directional Movement Index of each stock
- **close_x_sma**: Simple Moving Average of each stock
- **close_x_ema**: Exponential Moving Average of each stock

#### Action Space

For a single stock, the action space is defined as *{-k. ..., -1, 0, 1, ..., k}*, where *k* and *-k* represents the number of shares we can buy and sell. A predefined parameter is set as the maximum amount of shares for each buying/selling action. Therefore, a 21-dimensional vector will be used to represent the action space. It will be normalized to [-1, 1] since the RL algorithms A2C and PPO define the policy directly on a Gaussian distribution, which needs to be normalized and symmetric.

#### Reward Function

The reward function will be defined as the change of portfolio value when action ***a*** is taken at state ***s*** and arriving at a new state ***s'***. The goal is to design a trading strategy to maximize the change of the portfolio value.

### Training Strategies

In the model training part, we already investigated the Actor-Critic approach and applied 2 deep reinforcement learning models--PPO and A2C. We also consider to apply the Critic based like DQN or Actor bacsed approaches like Policy Graident method later on.

#### Proximal Policy Optimization (PPO)

**Proximal Policy Optimization** (PPO) [[1]](#1) is introduced to control the policy gradient update and ensure that the new policy will not be too different from the previous one.

According to the **Proximal Policy Optimization Algorithms**(PPO) [[1]](#1), the algorithm determines the maximum step size and find the local maximum of the policy within the region like policy gradient method to maximize the gradient. Compared to the TRPO, it directly introduces the KL divergence item as a policy learning penalty to ensure policy learning progress. The **PPO** as the off-policy strategy uses importance sampling in the historical trading data, which would provide more advantage considering the project historical data not being enough.

#### Advantage Actor-Critic (A2C)

In above **PPO** [[1]](#1), it would directly optimize the policy. However, it is also important to leverage the value methods—evaluated the expected return, as historical trading data already provide them well—to improve the reinforcement learning. **Advantage Actor-Critic (A2C)** [[2]](#2) method would be the another valuable option to investigate.

**Advantage Actor-Critic (A2C)** [[2]](#2) is a typical actor-critic algorithm. **A2C** uses copies of the same agent working in parallel to update gradients with different data samples. Each agent works independently to interact with the same environment.

## Trading Code

 [RL Training Notebook](https://github.com/BichengWang/RL_stock_trading/blob/master/rl_portfolio_trading.ipynb) 

[Source Code Repo](https://github.com/BichengWang/RL_stock_trading) 

## Results

| Method  | Training Period | Evaluation Period |
| ------------- | ------------- | ------------- |
| A2C | ![](img/training_a2c_backtest1.jpg) | ![](img/evaluation_a2c_backtest1.jpg) |
| PPO | ![](img/training_ppo_backtest1.jpg) | ![](img/evaluation_ppo_backtest1.jpg) |

| Method  | Training Period | Evaluation Period |
| ------------- | ------------- | ------------- |
| A2C | ![](img/training_a2c_backtest2.jpg) | ![](img/evaluation_a2c_backtest2.jpg) |
| PPO | ![](img/training_ppo_backtest2.jpg) | ![](img/evaluation_ppo_backtest2.jpg) |

## Future plan

We expect to improve the current implementation by:
- Add more technical indicators
- Try a few more RL models like SAC, TD3
- Change the usage of dataset: currently statistics of different stocks on the same trading day are merged to a single data point, we plan to change the dataset usage so that statistics of a single stock on a trading day can be treated as a single data point.

# References
<a id="1">[1]</a> 
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. 
Proximal Policy Optimization Algorithms.
arXiv:1707.06347
https://arxiv.org/abs/1707.06347   
<a id="2">[2]</a> 
Vijay Konda , John Tsitsiklis. 
Actor-critic algorithms.
Society for Industrial and Applied Mathematics.
https://papers.nips.cc/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf   
<a id="3">[3]</a> 
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu.
Asynchronous Methods for Deep Reinforcement Learning.
arXiv:1602.01783
https://arxiv.org/abs/1707.06347

