Short Backwarding for Selecting the Best Historical Agent in a Consulting System for Portfolio Allocation with Deep Reinforcement Learning

## 4.0 Feature Engineering and Data Preprocessing
---
We perform feature engineering and data preprocessing by:
* Adding Technical Indicators to the data. The technical inicators are used as inputs in the training of our Reinforcement Learning Model
* Adding Coveriance Matrices which is also used as input for training the Models
* Splitting the data into the training set and the testing (trading) set

### 4.1 Import Relevant Libraries

In [1]:
import pandas as pdp
import pandas as pd
import numpy as np
import ta
from ta import add_all_ta_features
from ta.utils import dropna

from finrl.preprocessing.data import data_split
from finrl.preprocessing.preprocessors import FeatureEngineer
from pickleshare import PickleShareDB


### 4.2 Load the Data from the csv Files

In [2]:
# Load the whole data set
data = pdp.read_csv('./datasets/data.csv')

# Load the close prices dataset
prices_data = pdp.read_csv('./datasets/close_prices.csv')



In [3]:
filtered_stocks = pd.read_csv('filtered_stocks.csv')
filtered_stocks = filtered_stocks.drop(columns=['Unnamed: 0'])
filtered_stocks = filtered_stocks['stock_name'].tolist()
%store filtered_stocks

FileNotFoundError: [Errno 2] No such file or directory: 'filtered_stocks.csv'

In [None]:
list_of_stocks = filtered_stocks
print(list_of_stocks)

['JNJ', 'PG', 'WMT', 'PFE', 'KO', 'MMM', 'MCD', 'VZ', 'IBM', 'RTX', 'MRK', 'HD', 'CSCO', 'XOM', 'DIS', 'CVX', 'AXP', 'CAT', 'MSFT', 'NKE']


In [None]:
data.head()

Unnamed: 0,date,tic,close,high,low,open,volume
0,2008-03-19,AAPL,3.915352,4.796071,4.631071,4.754286,1010537000.0
1,2008-03-19,AXP,32.371143,44.48,41.919998,44.200001,14098300.0
2,2008-03-19,BA,54.094543,77.0,73.449997,76.980003,9195600.0
3,2008-03-19,CAT,47.48143,77.0,73.730003,76.620003,7377400.0
4,2008-03-19,CSCO,16.594309,25.58,24.459999,25.469999,63988600.0


In [None]:
data = data[data['tic'].isin(list_of_stocks)]

In [None]:
data.tic.unique()

array(['AXP', 'CAT', 'CSCO', 'CVX', 'DIS', 'HD', 'IBM', 'JNJ', 'KO',
       'MCD', 'MMM', 'MRK', 'MSFT', 'NKE', 'PFE', 'PG', 'RTX', 'VZ',
       'WMT', 'XOM'], dtype=object)

### 4.3 Add Technical Indicators
---
We define a function to add technical indicators to the dataset by making use of the ta library

The folloing indicators are considered:
* Volatility Average True Range (ATR)
* Volatility Bollinger Band Width (BBW)
* Volume On-balance Volume (OBV
* Volume Chaikin Money Flow (CMF)
* Trend Moving Average Convergence Divergence (MACD)
* Trend Average Directional Index (ADX)
* Trend Fast Simple Moving Average (SMA)
* Trend Fast Exponential Moving Average (EMA)
* Trend Commodity Channel Index (CCI)
* Momentum Relative Strength Index (RSI)

In [None]:
# Define a Function for adding technical indicators

def add_features(data, feature_list, short_names):
    """
    Function to add technical indicators for features
    -Takes in a dataset with Open, High, Low, Close and Volume
    -Also takes in a list of the technical indicators to be added 
     as well as a list of the shortened indicator names
    """
    
    # list of column names to filter the features
    data_col_names = list(data.columns)
    filter_names = data_col_names + feature_list
    col_rename = data_col_names +  short_names
    
    # Add technical indicators using the ta Library
    data = add_all_ta_features(data, open="open", high="high", 
    low="low", close="close", volume="volume") 
    
    # Filter the Indicators with the required features
    data = data[filter_names]
    data.columns = col_rename # rename the columns to use shortened indicator names
    data = data.dropna()
    
    return data

In [None]:
# List of Features to add
feature_list= ['volatility_atr','volatility_bbw','volume_obv','volume_cmf',
               'trend_macd', 'trend_adx', 'trend_sma_fast', 
               'trend_ema_fast', 'trend_cci', 'momentum_rsi']

# Short names of the features
short_names = ['atr', 'bbw','obv','cmf','macd', 'adx', 'sma', 'ema', 'cci', 'rsi']

#feature_list= ['volatility_atr','volatility_bbw','volume_obv','volume_cmf','trend_macd']

# Short names of the features
#short_names = ['atr', 'bbw','obv','cmf','macd']

In [None]:
# Add Indicators to our dataset
data_with_features = data.copy()

data_with_features = add_features(data_with_features, feature_list, short_names)

  self._nvi.iloc[i] = self._nvi.iloc[i - 1] * (1.0 + price_change.iloc[i])


In [None]:
data_with_features.head()

Unnamed: 0,date,tic,close,high,low,open,volume,atr,bbw,obv,cmf,macd,adx,sma,ema,cci,rsi
39,2008-03-20,HD,18.727085,28.17,26.959999,26.969999,22243000.0,24.669007,194.75124,-379331975.0,-28.805267,-0.673785,0.0,28.695749,28.952696,-62.210727,47.082909
40,2008-03-20,IBM,65.684776,113.2696,111.520073,111.940727,11943123.0,31.656357,195.766428,-367388852.0,-28.903208,2.306319,0.0,33.322819,34.603785,193.529282,57.67538
42,2008-03-20,JNJ,40.486263,65.5,64.889999,64.970001,16276300.0,28.570199,195.814974,-383665152.0,-29.862948,2.604738,5.147927,33.185575,35.508782,52.325765,51.695322
44,2008-03-20,KO,18.561644,30.57,30.02,30.08,31028600.0,26.759806,195.534744,-414693752.0,-30.50206,1.059886,4.96755,32.298267,32.90153,-56.035501,47.117691
45,2008-03-20,MCD,34.636936,54.759998,53.700001,53.950001,13075600.0,27.703661,195.477884,-401618152.0,-30.808078,1.119813,4.760681,33.982515,33.168515,19.50627,50.573583


In [None]:
feature_list = list(data_with_features.columns)[7:]

In [None]:
print(feature_list)

['atr', 'bbw', 'obv', 'cmf', 'macd', 'adx', 'sma', 'ema', 'cci', 'rsi']


### 4.4 Add Covariance Matrix
---
We define a function that will add Covarance Matrices to our dataset

In [None]:
def add_cov_matrix(df):
    """
    Function to add Coveriance Matrices as part of the defined states
    """
    # Sort the data and index by date and tic
    df=df.sort_values(['date','tic'],ignore_index=True) 
    df.index = df.date.factorize()[0]
    
    cov_list = [] # create empty list for storing coveriance matrices at each time step
    
    # look back for constructing the coveriance matrix is one year
    lookback=252
    for i in range(lookback,len(df.index.unique())):
        data_lookback = df.loc[i-lookback:i,:]
        price_lookback=data_lookback.pivot_table(index = 'date',columns = 'tic', values = 'close')
        return_lookback = price_lookback.pct_change().dropna()
        covs = return_lookback.cov().values 
        covs = covs#/covs.max()
        cov_list.append(covs)
        
    df_cov = pd.DataFrame({'date':df.date.unique()[lookback:],'cov_list':cov_list})
    df = df.merge(df_cov, on='date')
    df = df.sort_values(['date','tic']).reset_index(drop=True)
    
    return df

In [None]:
# Add Covariance Matrices to our dataset
data_with_features_covs = data_with_features.copy()
data_with_features_covs = add_cov_matrix(data_with_features_covs)

In [None]:
data_with_features_covs.head()

Unnamed: 0,date,tic,close,high,low,open,volume,atr,bbw,obv,cmf,macd,adx,sma,ema,cci,rsi,cov_list
0,2009-03-20,AXP,9.658469,13.19,12.12,13.19,31088200.0,20.045677,230.002536,-55639230000.0,-19.695017,-0.890151,4.694125,18.653711,19.068885,-76.648004,47.215954,"[[0.0026107181410731633, 0.0012702991646503876..."
1,2009-03-20,CAT,17.987209,28.9,26.73,28.629999,16531300.0,19.965263,230.668541,-55622690000.0,-19.455215,-0.910501,4.391543,17.160566,18.902473,-21.133828,49.482302,"[[0.0026107181410731633, 0.0012702991646503876..."
2,2009-03-20,CSCO,10.789354,16.57,15.75,16.370001,66078200.0,18.192457,230.960632,-55688770000.0,-18.941235,-1.490257,4.276725,16.178579,17.654301,-64.68658,47.580937,"[[0.0026107181410731633, 0.0012702991646503876..."
3,2009-03-20,CVX,35.435562,67.980003,64.269997,67.540001,23811700.0,22.092276,230.081354,-55664960000.0,-18.625545,0.038581,4.799891,17.896791,20.38988,114.564514,54.086535,"[[0.0026107181410731633, 0.0012702991646503876..."
4,2009-03-20,DIS,14.977372,17.98,17.08,17.799999,17766600.0,21.718605,230.298636,-55682730000.0,-18.589908,-0.396044,4.597209,18.082995,19.557186,-54.702746,48.685249,"[[0.0026107181410731633, 0.0012702991646503876..."


### 4.6 Store the Dataframe

In [None]:
df = data_with_features_covs

In [None]:
df.to_csv('df.csv', index=False)
%store df

Stored 'df' (DataFrame)
