We input the model with 240 timesteps and 3 features and train it to predict the direction of the $241^{st}$ intraday return.

More precisely, for each stock $s$ at time $t$, we first consider the following three features $ir^{(s)}_{t, 1}, cr^{(s)}_{t, 1}, or^{(s)}_{t, 1}$ defined above.

Then we apply the Robust Scaler Standardization

$\tilde f^{(s)}_{t, 1} := \dfrac {f^{(s)}_{t,1} - Q_2(f^{(s)}_{.,1})} {Q_3(f^{(s)}_{.,1}) - Q_1(f^{(s)}_{.,1})}$

where $Q_1(f^{(s)}_{.,1}), Q_2(f^{(s)}_{.,1})$ and $Q_3(f^{(s)}_{.,1})$ are the first, second and third quartile of $f^{(s)}_{.,1}$, for each feature $f^{(s)}_{.,1} \in \{ ir^{(s)}_{., 1}, cr^{(s)}_{., 1}, or^{(s)}_{., 1} \}$ in the respective training period.

The Robust Scaler Standardization first subtracts (and hence removes) the median and then scales the data using the inter-quartile range, making it robust to outliers.

Next for each time $t \in \{ 240, 241, ..., T_{study} \}$, we generate overlapping sequence of 240 consecutive, three-dimensional standardized features $\{ \tilde F^{(s)}_{t-239,1}, \tilde F^{(s)}_{t-238,1}, ..., \tilde F^{(s)}_{t,1} \}$, where $\tilde F^{(s)}_{t-i,1} := (\tilde ir^{(s)}_{t-i,1}, \tilde cr^{(s)}_{t-i,1}, \tilde or^{(s)}_{t-i,1}), i \in \{ 239, 238, ..., 0 \}$.

In [None]:
def robust_scaler(f_t, f):
    quartile_25 = np.percentile(f, 25, interpolation="midpoint")
    quartile_50 = np.median(f)
    quartile_75 = np.percentile(f, 75, interpolation="midpoint")
    
    inter_qrt_range = quartile_75 - quartile_25
    
    return (f - quartile_50) / inter_qrt_range

In [33]:
# DEMO CODE 👇
def func():
    curr_dataset = main_datasets[0]
    
    # Take the total amount of days in 1st study period
    T_study = curr_dataset.shape[1]

    # Create the t =[241, 243, ..., T_study]
    t = np.arange(240, T_study)
    I = np.arange(239, -1, -1)
    
    # Define number of stocks as it will be used to create arrays with proper shapes
    n_stocks = 251

    # We use 240 consecutive 3 dimensional vectors to predict 241st return.
    # Create a container to store ir, cr and or for the current dataset. Shape (251, 1008, 240, 3). F -> (240, 3)
    container = np.ones(shape=(n_stocks, T_study, t[0], 3))

    # ir_t-i,1 |  cr_t-i,1 | or_t-i,1
    
    

    return container.shape


func()

numpy.ndarray

#  Forecasting directional movements of stock prices for intraday trading using Random Forest & CuDNNLSTM

-----------------------------------------------------------
*What Is Intraday Return?*

The intraday return is one of the two components of the total daily return generated by a stock. Intraday return measures the return generated by a stock during regular trading hours, based on its price change from the opening of a trading day to its close. Intraday return and overnight return together constitute the total daily return from a stock, which is based on the price change of a stock from the close of one trading day to the close of the next trading day. It is also called daytime return.

Intraday return is of particular importance for day traders, who use daytime gyrations in stocks and markets to make trading profits, and rarely leave positions open overnight. Day trading strategies are not as commonplace for regular investors as they were before the 2008-2009 recession.

---------------------------------------------------------
*How to Calculate Daily Returns?*

To calculate a daily return, you subtract the starting price from the closing price. Once you have that, you simply multiply by the number of shares you own.

To illustrate, let's say you own `100` shares of XYZ stock. The day opens at `$20` and closes at `$25`. This is a `$5` positive difference. Multiply the `$5` difference by the `100` shares you own, for a daily return of `$500`.

Some investors will prefer to work in percentages rather than dollar amounts. This is only slightly more complicated. You perform the same first step and arrive at a `$5` gain per share for the day. You then divide by the opening price of `$25`, leaving you with `0.2`. Multiply by `100` to arrive at your daily return of `20%`.

-----------------------------------------------------------

This project is done by following the methods and techniques of the paper `Forecasting directional movements of stock prices for intraday trading using LSTM and random forests`. Link to the paper: [Click Here](https://arxiv.org/pdf/2004.10178.pdf).

This introduces multi-feature setting consisting not only of the returns with respect to the closing prices, but also with respect to the opening prices and intraday returns to predict each stock, at the beginning of each day, the probablity to outperform the market in terms of intraday returns.

As dataset we use all stocks of the S&P 500 from the period of January 1990 until December 2018.

We employ both Random Forests on the one hand and LSTM on the other hand as training methodology.

### Technology

- Python: 3.9.16
- Scikit-Learn: 1.2.2
- Tensorflow: 2.12.0
- System RAM: 12.7 GB
- GPU RAM: 15.0 GB
- Disk: 78.2 GB

## Library Imports

In [1]:
# For data processing
import numpy as np
import pandas as pd

# Load pickle file
import pickle

# For Random Forest
from sklearn.ensemble import RandomForestRegressor

# LSTM and other layers
import tensorflow as tf

## Download the stocks and save it

Refer to `Collect Stock Data and Store it.ipynb` file to download the stocks data and save it.

Then come back to this file. But if you have already downloaded the data by running that file once then no need to run it again.

## Steps to follow - 
1. We divide our raw data into study periods, where each study period is divided into a training part(for in-sample trading) and a trading part(for out-sample predictions).
2. We introduce out features.
3. We set up our targets.
4. We define our 2 machine learning methods we employ, namely random forest and CuDNNLSTM.
5. Establish a trading strategy for trading part.

## Data preparation for Model


### Preparing the original dataset for further processing

*Segregate the stocks within different numpy arrays according to the ticker name.*
![Stack-stock-data-on-top-of-each-other.png](https://i.postimg.cc/HnDyJp9s/Stack-stock-data-on-top-of-each-other.png)

WARNING❗ 
- Different stocks will give different no. of rows, as all stocks were not always available in that time span.
- Here we are not able to properly get all days values.

To prevent the error, we are removing those stocks will are having empty values for those dates.

E.g. Stock A might start from 1991-01-01 and Stock B from 1990-12-31 then both of them are not in same shape. So we can remove stock B. Making sure all stocks are having same no of rows.

### Datasets creation with non-overlapping testing period from original dataset

We divide the dataset contsisting of 29 years starting from January 1990 till December 2018, using a 4-year window, 1-year stride, where each study period is divided into a training part(of 756 days almost = 3 years) and trading part(of 252 days almost = 1 year).

So, we obtain 26 study periods with non-overlapping trading part.

![Dataset-creation-with-non-overlapping-testing-period.png](https://i.postimg.cc/7YWw5YwL/Dataset-creation-with-non-overlapping-testing-period.png)

Refer to `Generating List of Datasets and Saving it.ipynb` file for the steps of `Preparing the original dataset for further processing` and `Datasets creation with non-overlapping testing period from original dataset`.

After running all the cells, come back this file. Now you are ready to run the below cells as it is loading the `datasets-list` pickle file saved by `Generating List of Datasets and Saving it.ipynb` file.

## Features Selection

Let $T_{study}$ denote the total amount of days in a study period and $n_i$ represent the number of stocks $s$ in $S$ having complete historical data available at the end of each study period $i$. Moreover, we define the adjacent closing price and opening price of any stock $s \in S$ at time $t$ by $cp^{(s)}_t$ and $op^{(s)}_t$.

Given a prediction day $t:=\tau$, we have the following inputs and prediction task.

Input: We have the historical opening prices, $op^{(s)}_t, t \in \{ 0, 1, ..., \tau -1, \tau\}$, (including the opening price of the prediction day $op^{(s)}_\tau$) as well as the historical adjacent closing prices, $cp^{(s)}_t, t \in \{ 0, 1, ..., \tau -1\}$, (excluding the opening price of the prediction day $cp^{(s)}_\tau$).

Task: Out of all n stocks, predict k stocks with highest and k stocks with lowest intraday return $ir_{\tau, 0} = \dfrac{cp_\tau}{op_\tau} - 1$.

**NOTE:** In the original paper they used all the stocks that could be scrapped from the web. Then they divided each stock into 26 datasets. Now, in this 26 datasets, some datasets may contain all 492 stocks that were originally scrapped and some datasets may contain only 251 stocks. That is why it is saying $s \in S$ because each dataset will have different number of stocks and that will be a subset of all the originally scrapped stocks.

But, in our case we are dealing with only those stocks which has all entries filled from 1990-01-02 to 2018-12-31. So, we have 251 stocks in all the datasets.



For LaTex markdown, refer to this page: [here](https://ashki23.github.io/markdown-latex.html)

### Feature generation for Random Forest

For any stock $s \in S$ and any time $t \in \{ 241, 242, ..., T_{study} \}$, the feature set we provide to the random forest comprises of 3 signal:

1. Intraday return: $ir^{(s)}_{t, m} := \dfrac{cp^{(s)}_{t-m}}{op^{(s)}_{t-m}} - 1$,


2. Returns with respect to last closing price: $cr^{(s)}_{t, m} := \dfrac{cp^{(s)}_{t-1}}{cp^{(s)}_{t-1-m}} - 1$,


3. Returns with respect to opening price: $or^{(s)}_{t, m} := \dfrac{op^{(s)}_{t}}{cp^{(s)}_{t-m}} - 1$,

where $m \in \{ 1, 2, 3, ..., 20 \} \cup \{ 40, 60, 80, ...., 240 \}$, obtaining 93 features. By the choice of m we consider in the first month the corresponding returns of each trading day, whereas for the subsequent 11 months we only consider the corresponding multi-period returns of each month.

![ir-cr-and-or-calculation.png](https://i.postimg.cc/T3sjRDQ1/ir-cr-and-or-calculation.png)

In [35]:
def calc_ir(cp: np.ndarray, op: np.ndarray, t: int, m: int) -> np.float64:
    '''
    Calculates Intraday return of t,m for one t and one m
    
    args:
        cp -> 1D array of closing prices for stock s
        op -> 1D array of opening prices for stock s
        t -> time at which we want the intraday return
        m -> How much we will subtract from t
        
    returns:
        Intraday return of t-th day
    '''
    
    return cp[t-m] / op[t-m] - 1

def calc_cr(cp: np.ndarray, t: int, m: int) -> np.float64:
    '''
    Calculates return with respect to last closing price for t, m
    
    args:
        cp -> 1D array of closing prices for stock s
        t -> time at which we want the intraday return
        m -> how many days before the t-th day
        
    returns:
        Return with respect to last closing price on t-th day
    '''
    
    return cp[t-1] / cp[t-1-m] - 1

def calc_or(cp: np.ndarray, op: np.ndarray, t: int, m: int) -> np.float64:
    '''
    Calculates return with respect to opening price for t, m
    
    args:
        cp -> 1D array of closing prices for stock s
        op -> 1D array of opening prices for stock s
        t -> time at which we want the intraday return
        m -> how many days before the t-th day
        
    returns:
        Return with respect to last closing price on t-th day
    '''
    
    return op[t] / cp[t-m] - 1

In [20]:
# This function will generate new features for Random Forest
def generate_features_rf(curr_dataset):

    # Take the total amount of days in 1st study period
    T_study = curr_dataset.shape[1]
    print("current dataset has", T_study, " days.")

    # Create the t =[241, 243, ..., T_study]
    t = np.arange(240, T_study)

    # Define the m for calculation of t-m, m = [1, 2, 3, ..., 20]
    M = np.arange(1, 21)

    # m = [1, 2, 3, ..., 20] U [40, 60, 80, ..., 240]
    M = np.concatenate((M, np.arange(40, 241, 20)))

    # Define number of stocks as it will be used to create arrays with proper shapes
    n_stocks = 251

    # Create a container to store ir, cr and or for the current dataset
    container = np.ones(shape=(n_stocks, T_study, M.shape[0]*3))

    # Put NaN values to the first 240 rows as it will be used for feature creation
    container[:, :t[0], :] = np.nan 


    # To calculate ir, we need cp_(t-m) and op_(t-m)
    cp_t_m = np.zeros((n_stocks, t.shape[0], M.shape[0]))
    op_t_m = np.zeros((n_stocks, t.shape[0], M.shape[0]))

    # To calculate cr, we need cp_(t-1-m) and cp_(t-1-m). Remember we are indexing from 0, not 1!
    cp_t_1_m = np.zeros((n_stocks, t.shape[0], M.shape[0]))
    cp_t_1 = curr_dataset[:, t-2, 0]

    # To calculate or, we need op_t and cp_t_m. Remember we are indexing from 0, not 1!
    op_t = curr_dataset[:, t-1, 1]


    # Calculate cp_(t-m), op_(t-m) and cp_(t-1-m) for each m and store them at proper axis=2 index i
    # of their respective container
    for i, m in enumerate(M):
        cp_t_m[:, :, i] = curr_dataset[:, t-m, 0]
        op_t_m[:, :, i] = curr_dataset[:, t-m, 1]
        cp_t_1_m[:, :, i] = curr_dataset[:, t-1-m, 0]


    # Calculate ir_(t-m)
    ir_t_m = np.divide(cp_t_m, op_t_m, out=np.zeros_like(cp_t_m), where=op_t_m!=0) - 1


    # Before calculating cr_(t-m), reshape the cp_(t-1-m) as it should have the same last part of shape as cp_(t-1), the divident
    # means if cp_(t-1) is (251, 774) then cp_(t-1-m) should be (_, 251, 774) notice the last of shape is same
    reshaped_cp_t_1_m = cp_t_1_m.reshape(M.shape[0], n_stocks, -1)

    # Calculating cr_(t-m)
    cr_t_m = np.divide(cp_t_1, reshaped_cp_t_1_m, where=reshaped_cp_t_1_m!=0).reshape(n_stocks, -1, M.shape[0]) - 1


    # Before calculating or_(t-m), reshape the cp_(t-m) as it should have the same last part of shape as op_t, the divident
    # means if op_t is (251, 774) then cp_(t-m) should be (_, 251, 774) notice the last of shape is same
    reshaped_cp_t_m = cp_t_m.reshape(M.shape[0], n_stocks, -1)

    # Calculating or_(t-m)
    or_t_m = np.divide(op_t, reshaped_cp_t_m, where=reshaped_cp_t_m!=0).reshape(n_stocks, -1, M.shape[0]) - 1


    # Put the ir, cr and or inside the container
    container[:, t, :] = np.dstack((ir_t_m, cr_t_m, or_t_m))

    return container

In [21]:
# It will contain all the newly processed datasets each with a shape (251, stock days in 4 years, 93)
containers = []

# Run the generate_feature_rf function for each dataset inside main_datasets
for dataset in main_datasets:
    containers.append(generate_features_rf(dataset))

current dataset has 1013  days.
current dataset has 1012  days.
current dataset has 1011  days.
current dataset has 1011  days.
current dataset has 1011  days.
current dataset has 1011  days.
current dataset has 1011  days.
current dataset has 1009  days.
current dataset has 1004  days.
current dataset has 1004  days.
current dataset has 1004  days.
current dataset has 1004  days.
current dataset has 1008  days.
current dataset has 1007  days.
current dataset has 1006  days.
current dataset has 1007  days.
current dataset has 1007  days.
current dataset has 1008  days.
current dataset has 1009  days.
current dataset has 1006  days.
current dataset has 1006  days.
current dataset has 1006  days.
current dataset has 1006  days.
current dataset has 1008  days.
current dataset has 1007  days.
current dataset has 1005  days.


In [22]:
len(containers)

26

### Feature generation for LSTM

We input the model with 240 timesteps and 3 features and train it to predict the direction of the $241^{st}$ intraday return.

More precisely, for each stock $s$ at time $t$, we first consider the following three features $ir^{(s)}_{t, 1}, cr^{(s)}_{t, 1}, or^{(s)}_{t, 1}$ defined above.

Then we apply the Robust Scaler Standardization

$\tilde f^{(s)}_{t, 1} := \dfrac {f^{(s)}_{t,1} - Q_2(f^{(s)}_{.,1})} {Q_3(f^{(s)}_{.,1}) - Q_1(f^{(s)}_{.,1})}$

where $Q_1(f^{(s)}_{.,1}), Q_2(f^{(s)}_{.,1})$ and $Q_3(f^{(s)}_{.,1})$ are the first, second and third quartile of $f^{(s)}_{.,1}$, for each feature $f^{(s)}_{.,1} \in \{ ir^{(s)}_{., 1}, cr^{(s)}_{., 1}, or^{(s)}_{., 1} \}$ in the respective training period.

The Robust Scaler Standardization first subtracts (and hence removes) the median and then scales the data using the inter-quartile range, making it robust to outliers.

Next for each time $t \in \{ 240, 241, ..., T_study \}$, we generate overlapping sequence of 240 consecutive, three-dimensional standardized features $\{ \tilde F^{(s)}_{t-239,1}, \tilde F^{(s)}_{t-238,1}, ..., \tilde F^{(s)}_{t,1} \}$, where $\tilde F^{(s)}_{t-i,1} := (\tilde ir^{(s)}_{t-i,1}, \tilde cr^{(s)}_{t-i,1}, \tilde or^{(s)}_{t-i,1}), i \in \{ 239, 238, ..., 0 \}$.

In [None]:
def generate_features_lstm(curr_dataset):
    '''
    This function generates features for LSTM.
    '''
    
    m = 1
    t = np.arange(1, 21)
    t = np.concatenate(t, np.arange(40, 241, 20))