In [None]:
# DEMO CODE 👇


(251, 1013, 93)

#  Forecasting directional movements of stock prices for intraday trading using Random Forest & CuDNNLSTM

-----------------------------------------------------------
*What Is Intraday Return?*

The intraday return is one of the two components of the total daily return generated by a stock. Intraday return measures the return generated by a stock during regular trading hours, based on its price change from the opening of a trading day to its close. Intraday return and overnight return together constitute the total daily return from a stock, which is based on the price change of a stock from the close of one trading day to the close of the next trading day. It is also called daytime return.

Intraday return is of particular importance for day traders, who use daytime gyrations in stocks and markets to make trading profits, and rarely leave positions open overnight. Day trading strategies are not as commonplace for regular investors as they were before the 2008-2009 recession.

---------------------------------------------------------
*How to Calculate Daily Returns?*

To calculate a daily return, you subtract the starting price from the closing price. Once you have that, you simply multiply by the number of shares you own.

To illustrate, let's say you own `100` shares of XYZ stock. The day opens at `$20` and closes at `$25`. This is a `$5` positive difference. Multiply the `$5` difference by the `100` shares you own, for a daily return of `$500`.

Some investors will prefer to work in percentages rather than dollar amounts. This is only slightly more complicated. You perform the same first step and arrive at a `$5` gain per share for the day. You then divide by the opening price of `$25`, leaving you with `0.2`. Multiply by `100` to arrive at your daily return of `20%`.

-----------------------------------------------------------

This project is done by following the methods and techniques of the paper `Forecasting directional movements of stock prices for intraday trading using LSTM and random forests`. Link to the paper: [Click Here](https://arxiv.org/pdf/2004.10178.pdf).

This introduces multi-feature setting consisting not only of the returns with respect to the closing prices, but also with respect to the opening prices and intraday returns to predict each stock, at the beginning of each day, the probablity to outperform the market in terms of intraday returns.

As dataset we use all stocks of the S&P 500 from the period of January 1990 until December 2018.

We employ both Random Forests on the one hand and LSTM on the other hand as training methodology.

### Technology

- Python: 3.9.16
- Scikit-Learn: 1.2.2
- Tensorflow: 2.12.0
- System RAM: 12.7 GB
- GPU RAM: 15.0 GB
- Disk: 78.2 GB

## Library Imports

In [1]:
# Download the historical price of stocks using yfinance
## Scrape the wikipedia page to get ticker names using BeautifulSoup and requests
from bs4 import BeautifulSoup
import requests
## Download the stock prices using yfinance
import yfinance as yf

# For data processing
import numpy as np
import pandas as pd

# For Random Forest
from sklearn.ensemble import RandomForestRegressor

# LSTM and other layers
import tensorflow as tf

## Download stock data

**RUN THE CELLS BELOW ONCE IF YOU ARE USING THIS IPYNB FILE FOR THE FIRST TIME.
THE CELLS BELOW WILL DOWNLOAD THE DATA IN THE DRIVE FOLDER `datasets/stock-prices-S&P-constituents` as `stocks-data.csv`. So, make sure to create this folder `datasets/stock-prices-S&P-constituents` if not present inside your drive.**


**BUT IF YOU ALREADY HAVE RUN THE BELOW CELLS ONCE THEN NO NEED TO RUN THEM AGAIN!! OTHERWISE, IT WILL AGAIN DOWNLOAD THE DATA WHICH WILL TAKE TIME TO COMPLETE!!**

----------------------------------------------------------

**Download the S&P 500 stocks price data**

The S&P 500 stock market index is maintained by S&P Dow Jones Indices. It comprises 503 common stocks which are issued by 500 large-cap companies traded on American stock exchanges (including the 30 companies that compose the Dow Jones Industrial Average). The index includes about 80 percent of the American equity market by capitalization. It is weighted by free-float market capitalization, so more valuable companies account for relatively more weight in the index. The index constituents and the constituent weights are updated regularly using rules published by S&P Dow Jones Indices. Although called the S&P 500, the index contains 503 stocks because it includes two share classes of stock from 3 of its component companies.

-----------------------------------------------------------
- Start Date: 1989-12-31
- End Sate: 2019-01-01

To clear any confusion we are actually taking stock price data from 1990-01-01 to 2018-12-31. But, at the time of downloading if we provide the exact dates then the stock price at those days will be excluded. So, we are taking dates 1 step before and after them respectively.

Web scraping code explanation: [Click Here](https://wire.insiderfinance.io/how-to-get-all-stocks-from-the-s-p500-in-python-fbe5f9cb2b61)

Scraped website link: [wikipedia link](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies)

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(res.text, 'lxml')
table = soup.findAll('table', {'class': 'wikitable sortable'})

tickers = []

for row in table[0].findAll('tr')[1:]:
  ticker = row.findAll('td')[0].text
  tickers.append(ticker)

tickers[0], tickers[1], tickers[-1], len(tickers)

('MMM\n', 'AOS\n', 'ZTS\n', 503)

In [3]:
tickers = [ticker.replace('\n', '') for ticker in tickers]
tickers[0], tickers[1], tickers[-1]

('MMM', 'AOS', 'ZTS')

In [4]:
start_date = "1989-12-31"
end_date = "2019-01-01"

# Download the data
data = yf.download(tickers, start_date, end_date)

[*********************100%***********************]  503 of 503 completed

11 Failed downloads:
- FOX: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- OGN: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- CARR: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- GEHC: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- DOW: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- CTVA: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- CEG: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- BF.B: No data found for this date range, symbol may be delisted
- FOXA: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- OTIS: Data doesn't exist for startDate = 631083600, endDate = 1546318800
- BRK.B: No timezone found, symbol may be delisted


We got the stock price data for 492 stocks.

In [5]:
data

Unnamed: 0_level_0,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,...,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume
Unnamed: 0_level_1,A,AAL,AAP,AAPL,ABBV,ABC,ABT,ACGL,ACN,ADBE,...,WYNN,XEL,XOM,XRAY,XYL,YUM,ZBH,ZBRA,ZION,ZTS
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1990-01-02,,,,0.264482,,,1.846023,,,1.188340,...,,247200,5326000,18000,,,,,53600,
1990-01-03,,,,0.266257,,,1.852688,,,1.247023,...,,126800,4980400,79200,,,,,111200,
1990-01-04,,,,0.267145,,,1.849356,,,1.305707,...,,204200,6013200,25200,,,,,1600,
1990-01-05,,,,0.268033,,,1.829362,,,1.335048,...,,144800,3854800,92400,,,,,0,
1990-01-08,,,,0.269808,,,1.838740,,,1.352692,...,,189000,4302000,98400,,,,,1600,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-12-24,60.658112,29.247074,138.501755,35.375175,68.045479,66.140625,60.909615,24.799999,125.596733,205.160004,...,2225000.0,2810600,14262800,1204200,542800.0,1806000.0,959548.0,363000.0,1504800,1551400.0
2018-12-26,63.435986,31.776182,144.184311,37.866348,71.991096,68.271774,64.681633,25.920000,130.614212,222.949997,...,3506200.0,5029800,24887700,2309900,806200.0,2030200.0,1667776.0,327200.0,2969800,1869700.0
2018-12-27,64.345802,31.530161,143.868118,37.620605,72.694504,68.729782,65.619972,26.540001,131.929642,225.139999,...,4229900.0,4759500,22077000,2042600,790800.0,2081600.0,1626267.0,504500.0,2534200,2244700.0
2018-12-28,64.000351,31.323500,144.584229,37.639885,73.672821,69.131706,66.047363,26.389999,131.375290,223.130005,...,2316100.0,5728300,19710600,1763500,782800.0,1699500.0,1915800.0,344800.0,2558600,1797300.0


The data does not look at all beautiful and also very difficult to understand.

In [6]:
df = data.stack().reset_index().rename(index=str, columns={"level_1": "Symbol"}).sort_values(['Symbol','Date'])
df.set_index('Date', inplace=True)
df

Unnamed: 0_level_0,Symbol,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1999-11-18,A,26.845926,31.473534,35.765381,28.612303,32.546494,62546380.0
1999-11-19,A,24.634192,28.880545,30.758226,28.478184,30.713518,15234146.0
1999-11-22,A,26.845926,31.473534,31.473534,28.657009,29.551144,6577870.0
1999-11-23,A,24.405386,28.612303,31.205294,28.612303,30.400572,5975611.0
1999-11-24,A,25.053659,29.372318,29.998213,28.612303,28.701717,4843231.0
...,...,...,...,...,...,...,...
2018-12-24,ZTS,77.151314,79.279999,80.910004,78.900002,80.910004,1551400.0
2018-12-26,ZTS,80.693573,82.919998,82.940002,79.139999,79.610001,1869700.0
2018-12-27,ZTS,82.065720,84.330002,84.330002,81.180000,81.830002,2244700.0
2018-12-28,ZTS,82.221428,84.489998,85.589996,83.550003,84.830002,1797300.0


In [7]:
df.to_csv('../datasets/stock-prices-S&P-constituents/stocks-data.csv')

## Steps to follow - 
1. We divide our raw data into study periods, where each study period is divided into a training part(for in-sample trading) and a trading part(for out-sample predictions).
2. We introduce out features.
3. We set up our targets.
4. We define our 2 machine learning methods we employ, namely random forest and CuDNNLSTM.
5. Establish a trading strategy for trading part.

## Data preparation for Model


### Preparing the original dataset for further processing

*Segregate the stocks within different numpy arrays according to the ticker name.*
![Stack-stock-data-on-top-of-each-other.png](https://i.postimg.cc/HnDyJp9s/Stack-stock-data-on-top-of-each-other.png)

WARNING❗ 
- Different stocks will give different no. of rows, as all stocks were not always available in that time span.
- Here we are not able to properly get all days values.

To prevent the error, we are removing those stocks will are having empty values for those dates.

E.g. Stock A might start from 1991-01-01 and Stock B from 1990-12-31 then both of them are not in same shape. So we can remove stock B. Making sure all stocks are having same no of rows.

In [8]:
# Retrieve the data from your drive
df = pd.read_csv('../datasets/stock-prices-S&P-constituents/stocks-data.csv')
df = df.set_index('Date')

In [9]:
# ticker of each stock
symbols = df['Symbol'].unique()

n_stocks = len(symbols) # number of stocks
n_rows = 7307 # no of stock days from 1990-01-01 to 2018-12-31, calculated 252*29-1=7307

stocks = [] # Store the stocks data inside stocks list
filtered_symbols = [] # This are the symbols for those stocks which have 7307 rows only

# segregate the stocks within different numpy arrays according to the ticker name
for i in range(n_stocks):
    total_captured_days = np.delete(df[df['Symbol'] == symbols[i]].reset_index().to_numpy(), 1, axis=1).shape[0]
    # Just take those stocks which was available from 1990-01-01 to 2018-12-31
    if total_captured_days == n_rows:
        stocks.append(np.delete(df[df['Symbol'] == symbols[i]].reset_index().to_numpy(), 1, axis=1))
        filtered_symbols.append(symbols[i])

In [10]:
# Create a check point
final_stocks = stocks

*Total number of stocks now are only 251. But we made sure these 251 stocks have same no of rows.*

In [11]:
len(final_stocks), len(filtered_symbols), type(final_stocks)

(251, 251, list)

In [12]:
# Convert the list into array for smooth manipulation of data later
final_stocks = np.array(final_stocks)
type(final_stocks), final_stocks.shape

(numpy.ndarray, (251, 7307, 7))

We just need opening price and adjacent closing price, so we can remove other features like Volume, Min, Max etc.

In [13]:
df[df['Symbol']=='AAPL']

Unnamed: 0_level_0,Symbol,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1990-01-02,AAPL,0.264482,0.332589,0.334821,0.312500,0.314732,183198400.0
1990-01-03,AAPL,0.266257,0.334821,0.339286,0.334821,0.339286,207995200.0
1990-01-04,AAPL,0.267145,0.335938,0.345982,0.332589,0.341518,221513600.0
1990-01-05,AAPL,0.268033,0.337054,0.341518,0.330357,0.337054,123312000.0
1990-01-08,AAPL,0.269808,0.339286,0.339286,0.330357,0.334821,101572800.0
...,...,...,...,...,...,...,...
2018-12-24,AAPL,35.375175,36.707500,37.887501,36.647499,37.037498,148676800.0
2018-12-26,AAPL,37.866348,39.292500,39.307499,36.680000,37.075001,234330000.0
2018-12-27,AAPL,37.620605,39.037498,39.192501,37.517502,38.959999,212468400.0
2018-12-28,AAPL,37.639885,39.057499,39.630001,38.637501,39.375000,169165600.0


In [14]:
final_stocks[0, 0, :]

array(['1990-01-02', 0.264482170343399, 0.3325890004634857,
       0.3348209857940674, 0.3125, 0.3147319853305816, 183198400.0],
      dtype=object)

In [15]:
final_stocks[0, 0, 2:4], final_stocks[0, 0, 6]

(array([0.3325890004634857, 0.3348209857940674], dtype=object), 183198400.0)

In [16]:
final_stocks = np.delete(final_stocks, np.s_[2:5], axis=2) # Delete Close, High, Low columns 
final_stocks = np.delete(final_stocks, 3, axis=2) # Delete Volume column, NOTE after first deletion index is changed
final_stocks.shape # Now it only contains Date, Adj Close and Open

(251, 7307, 3)

In [17]:
final_stocks[:, :, 0] = np.array([pd.to_datetime(stock_i).date for stock_i in final_stocks[:, :, 0]])

In [18]:
final_stocks[0, 0, 0]

datetime.date(1990, 1, 2)

### Datasets creation with non-overlapping testing period from original dataset

We divide the dataset contsisting of 29 years starting from January 1990 till December 2018, using a 4-year window, 1-year stride, where each study period is divided into a training part(of 756 days almost = 3 years) and trading part(of 252 days almost = 1 year).

So, we obtain 26 study periods with non-overlapping trading part.

![Dataset-creation-with-non-overlapping-testing-period.png](https://i.postimg.cc/7YWw5YwL/Dataset-creation-with-non-overlapping-testing-period.png)

**METHOD TO CREATE THE NON-OVERLAPPING TESTING PERIODS**

1. Store the dates inside temp variable.

  ![dates-layed-out-in-stock-price-prediction.png](https://i.postimg.cc/65MCHfGk/dates-layed-out-in-stock-price-prediction.png)

2. Define 2 variables. `year_start` that will point to the starting day of each dataset and `start_index` which will tell what is the exact index number of that starting day.
  ![year-start-start-index-variables.png](https://i.postimg.cc/MGvP8krx/year-start-start-index-variables.png)

3. `year_start` will go till `2015`. As 2015-2018 is the last last study period.

4. Another variable called `year_end` will point to the end of the year. To be precise it is not exactly the end of the last year but one day after the last day. I.e. if the `year_start = '1990-01-02'` then `year_end = '1994-01-02'`.

  Why this will make any sense?

  The reason is, we will try to find the exact index value of the first day and last day inside each dataset.
  Then, we will use condition indexing using those indexes. Now, we made the `year_end` like that because it will take less efforts to change the year value of `year_start` by `window_size` to get `year_end`. Then, in the condition we not include this `year_end`.
  
  ![year-end.png](https://i.postimg.cc/XY9CjPWR/year-end.png)

5. Index the `temp` that contained the whole 29 years time from `year_start`(including) to `year_end`(excluding). Condition will be `temp[(temp>=year_start) & (temp<year_end)]`. This is the `timeline` of the current dataset(not yet created!).
  ![timeline-creation.png](https://i.postimg.cc/JzrDkWfL/timeline-creation.png)

6. Calculate the `end_index` using the length of the current dataset's timeline and `start_index`. 
  
  `end_index = start_index + len(timeline)`
  ![progress-of-end-index.png](https://i.postimg.cc/T2rzsZHB/progress-of-end-index.png)

7. Slice the data from `start_index` to `end_index`(excluding). As the date is now useless for our further steps. So make sure to delete the date part by slicing from `index=1` till end as date's index = 0.

  `data[:, start_index:end_index, 1:]`
  
  Then append it to `datasets` list.

8. Finally update the `year_start` by moving the previous `year_start` `stride=1` year. If the `year_start` is `1990-01-02` then next time it will be `1991-01-02`.
![progress-year-start.png](https://i.postimg.cc/HkwvsY30/progress-year-start.png)

9. Now, as we have the new `year_start`, we will use it to find the `before_start_timeline` that has passed from `1990-01-02`. This timeline will store the days from `1990-01-02` to the current `year_start` date.

  It will help us find the new `start_index`. We need to just find the length of this `before_start_timeline` or in other words how many days have passed from `1990-01-02` till the current `year_start`.

![progress-of-start-index.png](https://i.postimg.cc/dtcKLdWy/progress-of-start-index.png)

In [19]:
def dataset_generator(data, window_size=4, stride=1):
    '''
    data: stocks data containing date from 1990 to 2018 -> dims = (_, _, _)
    window_size: no of years contained inside any dataset -> int
    stride: by how much amount the window should slide -> int

    returns list of datasets each having 'window_size'ed years of stock data -> list
    '''

    # datasets -> [D1, D2, D3, ..., D26], Di will be a (251, 4 year time length, 2)
    datasets = []

    # Step 1
    temp = data[0, :, 0]

    # Step 2
    year_start = data[0, 0, 0]
    start_index = 0

    # Step 3
    while year_start.year<=2015:

        # Step 4
        year_end = year_start.replace(year=year_start.year+window_size)

        # Step 5
        timeline = temp[(temp>=year_start) & (temp<year_end)]

        # Step 6
        end_index = start_index + timeline.shape[0]

        # Step 7
        datasets.append(data[:, start_index:end_index, 1:])

        # Step 8
        year_start = year_start.replace(year=year_start.year+stride)

        # Step 9
        before_start_timeline = temp[temp<=year_start]
        start_index = len(before_start_timeline)

    return datasets

In [20]:
datasets = dataset_generator(final_stocks)

In [21]:
len(datasets)

26

In [22]:
datasets[0].shape

(251, 1013, 2)

In [23]:
datasets[0][0, 0, :]

array([0.264482170343399, 0.3147319853305816], dtype=object)

Order of values for the above result : *Adj Close, Open*

In [24]:
main_datasets = datasets # Create checkpoint

## Features Selection

Let $T_{study}$ denote the total amount of days in a study period and $n_i$ represent the number of stocks $s$ in $S$ having complete historical data available at the end of each study period $i$. Moreover, we define the adjacent closing price and opening price of any stock $s \in S$ at time $t$ by $cp^{(s)}_t$ and $op^{(s)}_t$.

Given a prediction day $t:=\tau$, we have the following inputs and prediction task.

Input: We have the historical opening prices, $op^{(s)}_t, t \in \{ 0, 1, ..., \tau -1, \tau\}$, (including the opening price of the prediction day $op^{(s)}_\tau$) as well as the historical adjacent closing prices, $cp^{(s)}_t, t \in \{ 0, 1, ..., \tau -1\}$, (excluding the opening price of the prediction day $cp^{(s)}_\tau$).

Task: Out of all n stocks, predict k stocks with highest and k stocks with lowest intraday return $ir_{\tau, 0} = \dfrac{cp_\tau}{op_\tau} - 1$.

**NOTE:** In the original paper they used all the stocks that could be scrapped from the web. Then they divided each stock into 26 datasets. Now, in this 26 datasets, some datasets may contain all 492 stocks that were originally scrapped and some datasets may contain only 251 stocks. That is why it is saying $s \in S$ because each dataset will have different number of stocks and that will be a subset of all the originally scrapped stocks.

But, in our case we are dealing with only those stocks which has all entries filled from 1990-01-02 to 2018-12-31. So, we have 251 stocks in all the datasets.



For LaTex markdown, refer to this page: [here](https://ashki23.github.io/markdown-latex.html)

### Feature generation for Random Forest

For any stock $s \in S$ and any time $t \in \{ 241, 242, ..., T_{study} \}$, the feature set we provide to the random forest comprises of 3 signal:

1. Intraday return: $ir^{(s)}_{t, m} := \dfrac{cp^{(s)}_{t-m}}{op^{(s)}_{t-m}} - 1$,


2. Returns with respect to last closing price: $cr^{(s)}_{t, m} := \dfrac{cp^{(s)}_{t-1}}{cp^{(s)}_{t-1-m}} - 1$,


3. Returns with respect to opening price: $or^{(s)}_{t, m} := \dfrac{op^{(s)}_{t}}{cp^{(s)}_{t-m}} - 1$,

where $m \in \{ 1, 2, 3, ..., 20 \} \cup \{ 40, 60, 80, ...., 240 \}$, obtaining 93 features. By the choice of m we consider in the first month the corresponding returns of each trading day, whereas for the subsequent 11 months we only consider the corresponding multi-period returns of each month.

In [25]:
len(main_datasets)

26

In [28]:
# This function will generate new features for Random Forest
def generate_features_rf(curr_dataset):

    # Take the total amount of days in 1st study period
    T_study = curr_dataset.shape[1]
    print("current dataset has", T_study, " days.")

    # Create the t =[241, 243, ..., T_study]
    t = np.arange(240, T_study)

    # Define the m for calculation of t-m, m = [1, 2, 3, ..., 20]
    M = np.arange(1, 21)

    # m = [1, 2, 3, ..., 20] U [40, 60, 80, ..., 240]
    M = np.concatenate((M, np.arange(40, 241, 20)))

    # Define number of stocks as it will be used to create arrays with proper shapes
    n_stocks = 251

    # Create a container to store ir, cr and or for the current dataset
    container = np.ones(shape=(n_stocks, T_study, M.shape[0]*3))

    # Put NaN values to the first 240 rows as it will be used for feature creation
    container[:, :t[0], :] = np.nan 


    # To calculate ir, we need cp_(t-m) and op_(t-m)
    cp_t_m = np.zeros((n_stocks, t.shape[0], M.shape[0]))
    op_t_m = np.zeros((n_stocks, t.shape[0], M.shape[0]))

    # To calculate cr, we need cp_(t-1-m) and cp_(t-1-m). Remember we are indexing from 0, not 1!
    cp_t_1_m = np.zeros((n_stocks, t.shape[0], M.shape[0]))
    cp_t_1 = curr_dataset[:, t-2, 0]

    # To calculate or, we need op_t and cp_t_m. Remember we are indexing from 0, not 1!
    op_t = curr_dataset[:, t-1, 1]


    # Calculate cp_(t-m), op_(t-m) and cp_(t-1-m) for each m and store them at proper axis=2 index i
    # of their respective container
    for i, m in enumerate(M):
        cp_t_m[:, :, i] = curr_dataset[:, t-m, 0]
        op_t_m[:, :, i] = curr_dataset[:, t-m, 1]
        cp_t_1_m[:, :, i] = curr_dataset[:, t-1-m, 0]


    # Calculate ir_(t-m)
    ir_t_m = np.divide(cp_t_m, op_t_m, out=np.zeros_like(cp_t_m), where=op_t_m!=0) - 1


    # Before calculating cr_(t-m), reshape the cp_(t-1-m) as it should have the same last part of shape as cp_(t-1), the divident
    # means if cp_(t-1) is (251, 774) then cp_(t-1-m) should be (_, 251, 774) notice the last of shape is same
    reshaped_cp_t_1_m = cp_t_1_m.reshape(M.shape[0], n_stocks, -1)

    # Calculating cr_(t-m)
    cr_t_m = np.divide(cp_t_1, reshaped_cp_t_1_m, where=reshaped_cp_t_1_m!=0).reshape(n_stocks, -1, M.shape[0]) - 1


    # Before calculating or_(t-m), reshape the cp_(t-m) as it should have the same last part of shape as op_t, the divident
    # means if op_t is (251, 774) then cp_(t-m) should be (_, 251, 774) notice the last of shape is same
    reshaped_cp_t_m = cp_t_m.reshape(M.shape[0], n_stocks, -1)

    # Calculating or_(t-m)
    or_t_m = np.divide(op_t, reshaped_cp_t_m, where=reshaped_cp_t_m!=0).reshape(n_stocks, -1, M.shape[0]) - 1


    # Put the ir, cr and or inside the container
    container[:, t, :] = np.dstack((ir_t_m, cr_t_m, or_t_m))

    return container

In [29]:
# It will contain all the newly processed datasets each with a shape (251, stock days in 4 years, 93)
containers = []

# Run the generate_feature_rf function for each dataset inside main_datasets
for dataset in main_datasets:
    containers.append(generate_features_rf(dataset))

current dataset has 1013  days.
current dataset has 1012  days.
current dataset has 1011  days.
current dataset has 1011  days.
current dataset has 1011  days.
current dataset has 1011  days.
current dataset has 1011  days.
current dataset has 1009  days.
current dataset has 1004  days.
current dataset has 1004  days.
current dataset has 1004  days.
current dataset has 1004  days.
current dataset has 1008  days.
current dataset has 1007  days.
current dataset has 1006  days.
current dataset has 1007  days.
current dataset has 1007  days.
current dataset has 1008  days.
current dataset has 1009  days.
current dataset has 1006  days.
current dataset has 1006  days.
current dataset has 1006  days.
current dataset has 1006  days.
current dataset has 1008  days.
current dataset has 1007  days.
current dataset has 1005  days.


In [30]:
len(containers)

26

### Feature generation for LSTM

We input the model with 240 timesteps and 3 features and train it to predict the direction of the $241^{st}$ intraday return.

More precisely, for each stock $s$ at time $t$, we first consider the following three features $ir^{(s)}_{t, 1}, cr^{(s)}_{t, 1}, or^{(s)}_{t, 1}$ defined above.

Then we apply the Robust Scaler Standardization

$\tilde f^{(s)}_{t, 1} := \dfrac {f^{(s)}_{t,1} - Q_2(f^{(s)}_{.,1})} {Q_3(f^{(s)}_{.,1}) - Q_1(f^{(s)}_{.,1})}$

where $Q_1(f^{(s)}_{.,1}), Q_2(f^{(s)}_{.,1})$ and $Q_3(f^{(s)}_{.,1})$ are the first, second and third quartile of $f^{(s)}_{.,1}$, for each feature $f^{(s)}_{.,1} \in \{ ir^{(s)}_{., 1}, cr^{(s)}_{., 1}, or^{(s)}_{., 1} \}$ in the respective training period.

The Robust Scaler Standardization first subtracts (and hence removes) the median and then scales the data using the inter-quartile range, making it robust to outliers.

Next for each time $t \in \{ 240, 241, ..., T_study \}$, we generate overlapping sequence of 240 consecutive, three-dimensional standardized features $\{ \tilde F^{(s)}_{t-239,1}, \tilde F^{(s)}_{t-238,1}, ..., \tilde F^{(s)}_{t,1} \}$, where $\tilde F^{(s)}_{t-i,1} := (\tilde ir^{(s)}_{t-i,1}, \tilde cr^{(s)}_{t-i,1}, \tilde or^{(s)}_{t-i,1}), i \in \{ 239, 238, ..., 0 \}$.