# Reinforcement Learning Trading Bot

In this notebook, I will build bots using reinforcement learning for the trading of Pfizer stock. I will add important financial indicators to the datset such that these can be taken into account by the bots. The bots will vary in terms of the parameter "window_size", which reflects the number of time-steps used as reference data for the trading bot; I will evaluate the performance of these bots and the effect "window_size" has on profit.

### Preliminary code

In [80]:
#Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gym
import gym_anytrading
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import A2C
from finta import TA
from gym_anytrading.envs import StocksEnv
from sklearn.preprocessing import MinMaxScaler
import random
random.seed(2021)

In [81]:
#Reading in the data
PFE_data = pd.read_csv('C:/Users/chinm/Downloads/PFE_data.csv')

#Converting the date column to datetime and setting this to the index
PFE_data['Date'] = pd.to_datetime(PFE_data['Date'])
PFE_data = PFE_data.set_index('Date')

#Viewing the data
PFE_data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-03,31.024668,31.309298,30.920303,31.309298,25.928759,23391844
2017-01-04,31.432638,31.641365,31.337761,31.58444,26.156618,22753963
2017-01-05,31.660341,31.963947,31.423149,31.888046,26.408049,21083584
2017-01-06,31.935484,31.973434,31.63188,31.764706,26.305901,18418228
2017-01-09,31.717268,31.944971,31.669828,31.755219,26.298042,21559886


### Adding important indicators to the data

I will use the "finta" package imported earlier to add some important trading indicators to the dataset. The following indicators I will add are:
    
   1) The simple moving average (SMA) because this measure of price smooths out the data and adjusts for the random price fluctuations. I will use a different moving average for each window size, with the number of days used to compute the moving average set equal to the window-size for optimal performance.

   2) The relative strength index (RSI) because this measures momentum.
    
   3) The on-balance volume (OBV) because this is a cumulative of measure of volume that captures buying and selling pressure in the market.

In [82]:
#Calculating the quantities
SMA_5 = TA.SMA(PFE_data,5)
SMA_10 = TA.SMA(PFE_data,10)
SMA_15 = TA.SMA(PFE_data,15)
SMA_20 = TA.SMA(PFE_data,20)
RSI = TA.RSI(PFE_data)
OBV = TA.OBV(PFE_data)

In [83]:
#Adding these quantites as columns to the dataset
PFE_data['5-SMA'] = SMA_5
PFE_data['10-SMA'] = SMA_10
PFE_data['15-SMA'] = SMA_15
PFE_data['20-SMA'] = SMA_20
#PFE_data['MACD'] = MACD['MACD']
PFE_data['RSI'] = RSI
PFE_data['OBV'] = OBV

#Replacing null entries with 0
PFE_data = PFE_data.replace(np.nan,0)

In [84]:
#Scaling the data
PFE_data_1 = MinMaxScaler().fit_transform(PFE_data)#
PFE_data_1 = pd.DataFrame(PFE_data_1)
PFE_data_1.columns = PFE_data.columns
PFE_data_1.index = PFE_data.index
PFE_data = PFE_data_1.copy()

In [85]:
#Viewing the data
PFE_data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,5-SMA,10-SMA,15-SMA,20-SMA,RSI,OBV
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2017-01-03,0.223736,0.202487,0.263423,0.254228,0.088191,0.08092,0.0,0.0,0.0,0.0,0.0,0.2957
2017-01-04,0.248154,0.223209,0.288031,0.270575,0.102015,0.078084,0.0,0.0,0.0,0.0,1.0,0.316081
2017-01-05,0.261783,0.243339,0.293065,0.288613,0.11727,0.070659,0.0,0.0,0.0,0.0,1.0,0.334965
2017-01-06,0.278251,0.243931,0.305369,0.281285,0.111072,0.058811,0.0,0.0,0.0,0.0,0.80803,0.318468
2017-01-09,0.26519,0.242155,0.307606,0.280722,0.110595,0.072776,0.73228,0.0,0.0,0.0,0.795382,0.299157


### Creating the environments, reading in the models, and evaluating their performance

As mentioned earlier, models that vary in terms of window-size built in the code chunks below.

##### Window size = 5

In [86]:
#Defining the window size
wind_size = 5

#Defining a function that will yield the signals of the environment
def env_signals(env):
    start_index = env.frame_bound[0]-env.window_size
    end_index = env.frame_bound[1]
    prices = env.df.iloc[start_index:end_index]['Low'].to_numpy()
    signal_features = env.df.iloc[start_index:end_index][['Low','Volume','5-SMA','RSI','OBV']].to_numpy()
    return prices, signal_features

#Creating an environment classv
class environment(StocksEnv):
    _process_data = env_signals

#Building the environment
env = environment(df=PFE_data,window_size=wind_size,frame_bound=(wind_size,50))
make_env = lambda: env
env = DummyVecEnv([make_env])

#Training the model
model_5 = A2C('MlpLstmPolicy', env, verbose=1) 
model_5.learn(total_timesteps=40000)

---------------------------------
| explained_variance | -1.71    |
| fps                | 11       |
| nupdates           | 1        |
| policy_entropy     | 0.693    |
| total_timesteps    | 5        |
| value_loss         | 0.000895 |
---------------------------------
---------------------------------
| explained_variance | -1.54    |
| fps                | 167      |
| nupdates           | 100      |
| policy_entropy     | 0.693    |
| total_timesteps    | 500      |
| value_loss         | 0.00013  |
---------------------------------
---------------------------------
| explained_variance | 0.433    |
| fps                | 177      |
| nupdates           | 200      |
| policy_entropy     | 0.693    |
| total_timesteps    | 1000     |
| value_loss         | 0.00129  |
---------------------------------
---------------------------------
| explained_variance | -1.16    |
| fps                | 181      |
| nupdates           | 300      |
| policy_entropy     | 0.693    |
| total_timest

---------------------------------
| explained_variance | 0.441    |
| fps                | 189      |
| nupdates           | 3100     |
| policy_entropy     | 0.693    |
| total_timesteps    | 15500    |
| value_loss         | 8.24e-05 |
---------------------------------
---------------------------------
| explained_variance | -2.22    |
| fps                | 189      |
| nupdates           | 3200     |
| policy_entropy     | 0.693    |
| total_timesteps    | 16000    |
| value_loss         | 0.000202 |
---------------------------------
---------------------------------
| explained_variance | -0.731   |
| fps                | 189      |
| nupdates           | 3300     |
| policy_entropy     | 0.693    |
| total_timesteps    | 16500    |
| value_loss         | 0.000503 |
---------------------------------
---------------------------------
| explained_variance | -0.158   |
| fps                | 189      |
| nupdates           | 3400     |
| policy_entropy     | 0.693    |
| total_timest

---------------------------------
| explained_variance | 0.623    |
| fps                | 89       |
| nupdates           | 6200     |
| policy_entropy     | 0.693    |
| total_timesteps    | 31000    |
| value_loss         | 6.57e-05 |
---------------------------------
---------------------------------
| explained_variance | -68.2    |
| fps                | 90       |
| nupdates           | 6300     |
| policy_entropy     | 0.691    |
| total_timesteps    | 31500    |
| value_loss         | 0.000232 |
---------------------------------
---------------------------------
| explained_variance | 0.74     |
| fps                | 92       |
| nupdates           | 6400     |
| policy_entropy     | 0.692    |
| total_timesteps    | 32000    |
| value_loss         | 2.36e-05 |
---------------------------------
---------------------------------
| explained_variance | -30.5    |
| fps                | 93       |
| nupdates           | 6500     |
| policy_entropy     | 0.693    |
| total_timest

<stable_baselines.a2c.a2c.A2C at 0x15e9b9bb748>

In [87]:
#Testing the model and viewing its profit
env = environment(df=PFE_data, window_size=wind_size, frame_bound=(100,150))
env.seed(2021)
obs = env.reset()
while True: 
    obs = obs[np.newaxis, ...]
    action, _states = model_5.predict(obs)
    obs, rewards, done, info = env.step(action)
    if done:
        print("The total profit after 50 days is ",info['total_profit'])
        break

The total profit after 50 days is  1.0463290664053022


##### Window size = 10

In [88]:
#Defining the window size
wind_size = 10

#Defining a function that will yield the signals of the environment
def env_signals(env):
    start_index = env.frame_bound[0]-env.window_size
    end_index = env.frame_bound[1]
    prices = env.df.iloc[start_index:end_index]['Low'].to_numpy()
    signal_features = env.df.iloc[start_index:end_index][['Low','Volume','10-SMA','RSI','OBV']].to_numpy()
    return prices, signal_features

#Creating an environment classv
class environment(StocksEnv):
    _process_data = env_signals

#Building the environment
env = environment(df=PFE_data,window_size=wind_size,frame_bound=(wind_size,50))
make_env = lambda: env
env = DummyVecEnv([make_env])

#Training the model
model_10 = A2C('MlpLstmPolicy', env, verbose=1) 
model_10.learn(total_timesteps=40000)

---------------------------------
| explained_variance | 0.242    |
| fps                | 27       |
| nupdates           | 1        |
| policy_entropy     | 0.693    |
| total_timesteps    | 5        |
| value_loss         | 0.000111 |
---------------------------------
---------------------------------
| explained_variance | 0.0318   |
| fps                | 367      |
| nupdates           | 100      |
| policy_entropy     | 0.692    |
| total_timesteps    | 500      |
| value_loss         | 0.000128 |
---------------------------------
---------------------------------
| explained_variance | 0.479    |
| fps                | 402      |
| nupdates           | 200      |
| policy_entropy     | 0.692    |
| total_timesteps    | 1000     |
| value_loss         | 6.97e-05 |
---------------------------------
---------------------------------
| explained_variance | 0.622    |
| fps                | 409      |
| nupdates           | 300      |
| policy_entropy     | 0.693    |
| total_timest

---------------------------------
| explained_variance | -0.183   |
| fps                | 399      |
| nupdates           | 3100     |
| policy_entropy     | 0.69     |
| total_timesteps    | 15500    |
| value_loss         | 2.48e-05 |
---------------------------------
---------------------------------
| explained_variance | -27.8    |
| fps                | 398      |
| nupdates           | 3200     |
| policy_entropy     | 0.692    |
| total_timesteps    | 16000    |
| value_loss         | 0.000561 |
---------------------------------
---------------------------------
| explained_variance | 0.853    |
| fps                | 396      |
| nupdates           | 3300     |
| policy_entropy     | 0.693    |
| total_timesteps    | 16500    |
| value_loss         | 0.000254 |
---------------------------------
---------------------------------
| explained_variance | -0.111   |
| fps                | 396      |
| nupdates           | 3400     |
| policy_entropy     | 0.693    |
| total_timest

---------------------------------
| explained_variance | 0.319    |
| fps                | 363      |
| nupdates           | 6200     |
| policy_entropy     | 0.693    |
| total_timesteps    | 31000    |
| value_loss         | 0.000132 |
---------------------------------
---------------------------------
| explained_variance | -0.718   |
| fps                | 360      |
| nupdates           | 6300     |
| policy_entropy     | 0.693    |
| total_timesteps    | 31500    |
| value_loss         | 9.05e-05 |
---------------------------------
---------------------------------
| explained_variance | 0.762    |
| fps                | 359      |
| nupdates           | 6400     |
| policy_entropy     | 0.692    |
| total_timesteps    | 32000    |
| value_loss         | 4.06e-05 |
---------------------------------
---------------------------------
| explained_variance | -10.2    |
| fps                | 357      |
| nupdates           | 6500     |
| policy_entropy     | 0.692    |
| total_timest

<stable_baselines.a2c.a2c.A2C at 0x15e9e760d88>

In [89]:
#Testing the model and viewing its profit
env = environment(df=PFE_data, window_size=wind_size, frame_bound=(100,150))
env.seed(2021)
obs = env.reset()
while True: 
    obs = obs[np.newaxis, ...]
    action, _states = model_10.predict(obs)
    obs, rewards, done, info = env.step(action)
    if done:
        print("The total profit after 50 days is ",info['total_profit'])
        break

The total profit after 50 days is  0.8370495085479697


##### Window size = 15

In [90]:
#Defining the window size
wind_size = 15

#Defining a function that will yield the signals of the environment
def env_signals(env):
    start_index = env.frame_bound[0]-env.window_size
    end_index = env.frame_bound[1]
    prices = env.df.iloc[start_index:end_index]['Low'].to_numpy()
    signal_features = env.df.iloc[start_index:end_index][['Low','Volume','15-SMA','RSI','OBV']].to_numpy()
    return prices, signal_features

#Creating an environment classv
class environment(StocksEnv):
    _process_data = env_signals

#Building the environment
env = environment(df=PFE_data,window_size=wind_size,frame_bound=(wind_size,50))
make_env = lambda: env
env = DummyVecEnv([make_env])

#Training the model
model_15 = A2C('MlpLstmPolicy', env, verbose=1) 
model_15.learn(total_timesteps=40000)

---------------------------------
| explained_variance | -41.2    |
| fps                | 25       |
| nupdates           | 1        |
| policy_entropy     | 0.693    |
| total_timesteps    | 5        |
| value_loss         | 0.000366 |
---------------------------------
---------------------------------
| explained_variance | 0.406    |
| fps                | 351      |
| nupdates           | 100      |
| policy_entropy     | 0.693    |
| total_timesteps    | 500      |
| value_loss         | 6.19e-05 |
---------------------------------
---------------------------------
| explained_variance | -2.01    |
| fps                | 380      |
| nupdates           | 200      |
| policy_entropy     | 0.693    |
| total_timesteps    | 1000     |
| value_loss         | 0.000233 |
---------------------------------
---------------------------------
| explained_variance | -0.462   |
| fps                | 386      |
| nupdates           | 300      |
| policy_entropy     | 0.693    |
| total_timest

---------------------------------
| explained_variance | 0.356    |
| fps                | 361      |
| nupdates           | 3100     |
| policy_entropy     | 0.693    |
| total_timesteps    | 15500    |
| value_loss         | 6.91e-05 |
---------------------------------
---------------------------------
| explained_variance | 0.782    |
| fps                | 357      |
| nupdates           | 3200     |
| policy_entropy     | 0.693    |
| total_timesteps    | 16000    |
| value_loss         | 0.00024  |
---------------------------------
---------------------------------
| explained_variance | 0.213    |
| fps                | 355      |
| nupdates           | 3300     |
| policy_entropy     | 0.692    |
| total_timesteps    | 16500    |
| value_loss         | 0.00121  |
---------------------------------
---------------------------------
| explained_variance | -3.12    |
| fps                | 351      |
| nupdates           | 3400     |
| policy_entropy     | 0.693    |
| total_timest

---------------------------------
| explained_variance | 0.349    |
| fps                | 303      |
| nupdates           | 6200     |
| policy_entropy     | 0.693    |
| total_timesteps    | 31000    |
| value_loss         | 0.000462 |
---------------------------------
---------------------------------
| explained_variance | -2.3     |
| fps                | 304      |
| nupdates           | 6300     |
| policy_entropy     | 0.692    |
| total_timesteps    | 31500    |
| value_loss         | 0.000276 |
---------------------------------
---------------------------------
| explained_variance | -6.54    |
| fps                | 305      |
| nupdates           | 6400     |
| policy_entropy     | 0.689    |
| total_timesteps    | 32000    |
| value_loss         | 3.84e-05 |
---------------------------------
---------------------------------
| explained_variance | -12.7    |
| fps                | 305      |
| nupdates           | 6500     |
| policy_entropy     | 0.693    |
| total_timest

<stable_baselines.a2c.a2c.A2C at 0x15e9fe71fc8>

In [91]:
#Testing the model and viewing its profit
env = environment(df=PFE_data, window_size=wind_size, frame_bound=(100,150))
env.seed(2021)
obs = env.reset()
while True: 
    obs = obs[np.newaxis, ...]
    action, _states = model_15.predict(obs)
    obs, rewards, done, info = env.step(action)
    if done:
        print("The total profit after 50 days is ",info['total_profit'])
        break

The total profit after 50 days is  0.9776773625005819


##### Window size = 20

In [94]:
#Defining the window size
wind_size = 20

#Defining a function that will yield the signals of the environment
def env_signals(env):
    start_index = env.frame_bound[0]-env.window_size
    end_index = env.frame_bound[1]
    prices = env.df.iloc[start_index:end_index]['Low'].to_numpy()
    signal_features = env.df.iloc[start_index:end_index][['Low','Volume','20-SMA','RSI','OBV']].to_numpy()
    return prices, signal_features

#Creating an environment class
class environment(StocksEnv):
    _process_data = env_signals

#Building the environment
env = environment(df=PFE_data,window_size=wind_size,frame_bound=(wind_size,50))
make_env = lambda: env
env = DummyVecEnv([make_env])

#Training the model
model_20 = A2C('MlpLstmPolicy', env, verbose=1) 
model_20.learn(total_timesteps=40000)

---------------------------------
| explained_variance | 0.597    |
| fps                | 28       |
| nupdates           | 1        |
| policy_entropy     | 0.693    |
| total_timesteps    | 5        |
| value_loss         | 0.000887 |
---------------------------------
---------------------------------
| explained_variance | -1.03    |
| fps                | 349      |
| nupdates           | 100      |
| policy_entropy     | 0.693    |
| total_timesteps    | 500      |
| value_loss         | 0.00122  |
---------------------------------
---------------------------------
| explained_variance | -0.231   |
| fps                | 374      |
| nupdates           | 200      |
| policy_entropy     | 0.692    |
| total_timesteps    | 1000     |
| value_loss         | 0.00101  |
---------------------------------
---------------------------------
| explained_variance | -4.75    |
| fps                | 377      |
| nupdates           | 300      |
| policy_entropy     | 0.693    |
| total_timest

---------------------------------
| explained_variance | 0.842    |
| fps                | 397      |
| nupdates           | 3100     |
| policy_entropy     | 0.693    |
| total_timesteps    | 15500    |
| value_loss         | 0.00017  |
---------------------------------
---------------------------------
| explained_variance | -0.698   |
| fps                | 397      |
| nupdates           | 3200     |
| policy_entropy     | 0.693    |
| total_timesteps    | 16000    |
| value_loss         | 0.000175 |
---------------------------------
---------------------------------
| explained_variance | 0.153    |
| fps                | 398      |
| nupdates           | 3300     |
| policy_entropy     | 0.693    |
| total_timesteps    | 16500    |
| value_loss         | 5.25e-05 |
---------------------------------
---------------------------------
| explained_variance | -331     |
| fps                | 397      |
| nupdates           | 3400     |
| policy_entropy     | 0.692    |
| total_timest

---------------------------------
| explained_variance | -1.06    |
| fps                | 400      |
| nupdates           | 6200     |
| policy_entropy     | 0.693    |
| total_timesteps    | 31000    |
| value_loss         | 8.17e-05 |
---------------------------------
---------------------------------
| explained_variance | -260     |
| fps                | 400      |
| nupdates           | 6300     |
| policy_entropy     | 0.691    |
| total_timesteps    | 31500    |
| value_loss         | 0.000137 |
---------------------------------
---------------------------------
| explained_variance | 0.76     |
| fps                | 400      |
| nupdates           | 6400     |
| policy_entropy     | 0.693    |
| total_timesteps    | 32000    |
| value_loss         | 7.18e-05 |
---------------------------------
---------------------------------
| explained_variance | -10      |
| fps                | 400      |
| nupdates           | 6500     |
| policy_entropy     | 0.693    |
| total_timest

<stable_baselines.a2c.a2c.A2C at 0x15eaaf95288>

In [107]:
#Testing the model and viewing its profit
env = environment(df=PFE_data, window_size=wind_size, frame_bound=(100,150))
env.seed(2021)
obs = env.reset()
while True: 
    obs = obs[np.newaxis, ...]
    action, _states = model_20.predict(obs)
    obs, rewards, done, info = env.step(action)
    if done:
        print("The total profit after 50 days is ",info['total_profit'])
        break

The total profit after 50 days is  1.3488334580650414


Thus, the model with the window-size of 20 days delivers the higest profit of almost 35%. This may be due to the fact that his model has the highest window-size, meaning it uses the largest frame of reference data to make its trading decisions. Future work could use higher window sizes and identify the optimal window-size for the reference data. Additionally, in future projects, I intend on making greater use of the package Finta- I have used a fraction of the financial indicators it can calculate, and I am curious to ascertain the effect of use different numbers and combinations of financial indicators and environment signals for the trading bot.

### References

1. https://github.com/nicknochnack/Reinforcement-Learning-for-Trading