Stock-HIGH-LOW-Prediction

To predict stock’s day HIGH and LOW prices based on its OPEN and historical daily price by LSTM model.

Problem

To predict stock’s day HIGH and LOW prices based on its OPEN and historical daily price.

Data Description

The data was downloaded from the Yahoo Finance. The underlying stock is Tencent (700.hk). I renamed the column names as the following:

DATE: Trading date

OPEN:Current day open price

CLOSE:Current day last traded price recorded on exchange

HIGH:Current day highest trade price

LOW:Current day lowest trade price

VOLUME:Current day trade volume

Data File Path

The data file has been uploaded as https://mryinglee.github.io/DailyPrices.csv

DATA_PATH='https://mryinglee.github.io/DailyPrices.csv'

Academic Review

The time-varying volatility and volatility clustering are often observed in the financial market. High and low price can be treated as an indicator of volatility. There are some papers to research on prediction of high and low price.

High and Low Prices Research Review

(Mainly digested from High and Low Intraday Commodity Prices: A Fractional Integration and Cointegration Approach Yaya, OlaOluwa S and Gil-Alana, Luis A. University of Navarra, Spain, University of Ibadan, Nigeria , 5 December 2018)

In financial economics, the difference between high and low intraday or daily prices is known as the range. Volatility can be expected to be higher if the range is wider. Parkinson (1980) showed that, in fact, the price range is a more efficient volatility estimator than alternatives such as the return-based estimator. It is also frequently used in technical analysis by traders in financial markets (see, e.g., Taylor and Allen, 1992). However, as pointed out by Cheung et al. (2009), focusing on the range itself might be useful if one’s only purpose is to obtain an efficient proxy for the underlying volatility, but it also means discarding useful information about price behaviour that can be found in its components. Therefore, in their study, Cheung et al. (2009) analyse simultaneously both the range and daily highs and lows using daily data for various stock market indices. Because of the observation that the latter two variables generally do not diverge significantly over time, having found that they both exhibit unit roots by carrying out Augmented Dickey-Fuller (ADF) tests (Dickey and Fuller, 1979), they model their behaviour using a cointegration framework as in Johansen (1991) and Johansen and Juselius (1990) to investigate whether they are linked through a long-run equilibrium relationship, and interpreting the range as a stationary error correction term. They then show that such a model has better in-sample properties than rival ARMA specifications but does not clearly outperform them in terms of its out-of-sample properties.

Following on from Cheung et al. (2009), the present study makes a twofold contribution to the literature. First, it uses fractional integration and cointegration methods that are more general than the standard framework based on the I(0) versus I(1) dichotomy. According to the efficient market hypothesis (EMH), asset prices should be unpredictable and follow a random walk (see Fama, 1970), i.e. they should be integrated of order 1 or I(1). However, the choice between stationary I(0) and nonstationary I(1) processes is too restrictive for most financial series (Barunik and Dvorakova, 2015). Diebold and Rudebusch (1991) and Hasslers and 3 Wolters (1994) showed that in fact unit root tests have very low power in the context of fractional integration. Therefore our analysis below allows the differencing parameter for the individual series to take fractional values. Moreover, we adopt a fractional cointegration approach to test for the long-run relationships. Fiess and MacDonald (2002), Cheung (2007) and Cheung et al. (2009) all modelled high and low prices together with the range in a cointegration framework to analyse the foreign exchange and stock markets, respectively. However, their studies restrict the cointegrating parameter to unity (even though this is not imposed in Granger’s (1986) seminal paper). Fractional cointegration models (see also Robinson and Yajima, 2002; Nielsen and Shimotsu, 2007; etc.) are more general and have already been shown to be more suitable for many financial series (see, e.g., Caporale and Gil-Alana, 2014 and Erer et al., 2016). The FCVAR model in particular has a number of advantages over the fractional cointegration setup of Robinson and Marrinuci (2003): it allows for multiple time series and long-run equilibrium relationships to be determined using the statistical test of MacKinnon and Nielsen (2014), and it jointly estimates the adjustment coefficients and the cointegrating relations.1 Nielsen and Popiel (2018) provide a Matlab package for the calculation of the estimators and test statistics. Dolatabadi et al. (2016) applied the FCVAR model to analyse the relationship between spot and futures prices in future commodity markets and found more support for cointegration compared to the case when the cointegration parameters are restricted to unity.

On the predictability of stock prices: a case for high and low prices

Caporin, M., Ranaldo, A. and Santucci de Magistris, P. (2013). On the predictability of stock prices: a case for high and low prices. Journal of Banking and Finance, 37(12), 5132–5146. argued:

Contrary to the common wisdom that asset prices are hardly possible to forecast, we show that high and low prices of equity shares are largely predictable. We propose to model them using a simple implementation of a fractional vector autoregressive model with error correction (FVECM). This model captures two fundamental patterns of high and low prices: their cointegrating relationship and the long memory of their difference (i.e. the range), which is a measure of realized volatility. Investment strategies based on FVECM predictions of high/low US equity prices as exit/entry signals deliver a superior performance even on a risk-adjusted basis.

Some FCVAR related implementation

LeeMorinUCF/FCVAR: Fractionally Cointegrated VAR Model based on Matlab.

Tinkat/FCVAR: Fractionally cointegrated vector autoregressive model based on R.

Why Machine Learning?

Machine learning is a morden tools to process stock data.

Why LSTM model?

LSTM is a good model to deal with process with memory effect. I have a summary on LSTM model on Github (Stock-Price-Specific-LSTM)

Setup Environment

We use Keras + Tensorflow along with other Python ecosystem packages, such as numpy, pandas and matplotlib.

!pip install tensorflow-gpu

!pip install mpl_finance

import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from tensorflow.keras.layers import Bidirectional, Dropout, Activation, Dense, LSTM
from tensorflow.python.keras.layers import CuDNNLSTM
from tensorflow.keras.models import Sequential

%matplotlib inline

rcParams['figure.figsize'] = 28, 16

Load Data and Overview Data

df=pd.read_csv(DATA_PATH,parse_dates=['DATE'])
df.set_index('DATE',inplace=True)
df = df.sort_values('DATE')
df.head

<bound method NDFrame.head of              OPEN   HIGH    LOW  CLOSE  ADJ CLOSE    VOLUME
DATE                                                       
2015-01-21  126.2  128.9  125.2  128.7      127.1  33788560
2015-01-22  130.5  132.0  129.7  131.6      130.0  39063807
2015-01-23  134.6  134.9  131.1  132.7      131.1  29965533
2015-01-26  136.7  137.1  134.3  137.0      135.3  34952624
2015-01-27  138.0  138.0  133.0  136.0      134.3  24455759
...           ...    ...    ...    ...        ...       ...
2020-01-14  410.0  413.0  396.6  400.4      400.4  26827634
2020-01-15  397.2  403.0  396.2  398.8      398.8  15938138
2020-01-16  399.0  403.0  396.4  400.0      400.0  13770626
2020-01-17  400.0  400.6  396.0  399.0      399.0  13670846
2020-01-20  405.0  405.0  396.0  396.0      396.0  13282412

[1231 rows x 6 columns]>

df.shape

(1231, 6)

df.dropna(inplace=True)

df.shape

(1231, 6)

import matplotlib.dates as mdates
from mpl_finance import candlestick_ohlc
df_candle = df[['OPEN', 'HIGH', 'LOW', 'CLOSE']]

df_candle.reset_index(inplace=True)
df_candle['DATE'] = df_candle['DATE'].map(mdates.date2num)
# df_candle.drop(columns='DATE',inplace=True)
df_candle=df_candle.astype(float)
df_candle=df_candle[['DATE','OPEN', 'HIGH', 'LOW', 'CLOSE']]
df_candle.head


# Define a new subplot
ax = plt.subplot()

# Plot the candlestick chart and put date on x-Axis
candlestick_ohlc(ax, df_candle.values, width=5, colorup='g', colordown='r')
ax.xaxis_date()

# Turn on grid
ax.grid(True)

# Show plot
plt.show()

Obviously, there are gaps in the candle chart. In other words, the current day's price range[LOW, HIGH] has no overlap to that of the previous day.

OPEN price instead of the previous prices is a better indicator for HIGH/LOW

In theory, market price is to response to market news, which most likely is released between trading days. And news could trigger the price jump up or down, which makes the previous price range cannot be a good leading indicator for the current price range. So that the OPNE price has priced in more news.
Even for the situation of dividends, the OPEN price as an indicator can deal with, which is better than previous prices.
The OPEN is always in the range of [LOW, HIGH].
It's pratical to predict HIGH/LOW after the market is open. Especially for a day trader.
Technically if we use OPEN as a benchmark, OPEN/HIGH and LOW/OPEN are both belongs to (0,1], which is good for deep learning to optimize.

If we don't use OPEN as an indicator, in feature engineering, we should use other capping tricks to make sure the relative of HIGH and LOW are within (0,1].

To Check the Tick Movement Frequency Distribution

We need to check the tick distribution to standardize the tick data

TICK_UNIT=0.2

df['HIGH_tick']=(np.round((df['HIGH']-df['OPEN'])/TICK_UNIT)).astype('int32')
df['LOW_tick']=(np.round((df['OPEN']-df['LOW'])/TICK_UNIT)).astype('int32')

df['RANGE_TICKS']=df['HIGH_tick']+df['LOW_tick'] ## ticks between day low and day high

df.max()

OPEN                 474.0
HIGH                 476.6
LOW                  466.8
CLOSE                474.6
ADJ CLOSE            472.3
VOLUME         308436765.0
HIGH_tick            102.0
LOW_tick              84.0
RANGE_TICKS          108.0
dtype: float64

df.mean()

OPEN           2.691491e+02
HIGH           2.717566e+02
LOW            2.661037e+02
CLOSE          2.688868e+02
ADJ CLOSE      2.676347e+02
VOLUME         2.027254e+07
HIGH_tick      1.303574e+01
LOW_tick       1.522177e+01
RANGE_TICKS    2.825751e+01
dtype: float64

df[["HIGH_tick","LOW_tick"]].groupby("HIGH_tick").count().head(40)

	LOW_tick
HIGH_tick
0	183
1	48
2	64
3	55
4	63
5	77
6	28
7	42
8	48
9	43
10	40
11	33
12	32
13	33
14	40
15	36
16	17
17	24
18	22
19	28
20	19
21	16
22	12
23	11
24	17
25	12
26	12
27	10
28	11
29	11
30	6
31	1
32	15
33	8
34	10
35	9
36	4
37	7
38	6
39	4

df[["HIGH_tick","LOW_tick"]].groupby("LOW_tick").count().head(40)

	HIGH_tick
LOW_tick
0	108
1	37
2	50
3	42
4	47
5	58
6	39
7	60
8	53
9	47
10	57
11	37
12	45
13	43
14	21
15	42
16	19
17	29
18	21
19	25
20	32
21	20
22	20
23	19
24	11
25	20
26	15
27	13
28	8
29	10
30	16
31	11
32	8
33	11
34	12
35	3
36	4
37	11
38	10
39	7

df_low_freq=pd.DataFrame({'count' : df[["HIGH_tick","LOW_tick"]].groupby("LOW_tick").size()}).reset_index()
plt.bar(df_low_freq["LOW_tick"], df_low_freq["count"])

<BarContainer object of 72 artists>

df_high_freq=pd.DataFrame({'count' : df[["HIGH_tick","LOW_tick"]].groupby("HIGH_tick").size()}).reset_index()
plt.bar(df_high_freq["HIGH_tick"], df_high_freq["count"])

<BarContainer object of 73 artists>

df_range_freq=pd.DataFrame({'count' : df[["RANGE_TICKS","LOW_tick"]].groupby("RANGE_TICKS").size()}).reset_index()
plt.bar(df_range_freq["RANGE_TICKS"], df_range_freq["count"])

<BarContainer object of 89 artists>

MAX_RANGE_TICKS=100 # We use this to standardize the RANGE_TICKS

Findings on Tick Movement Frequency Distribution

Based on OPEN data, the up/down tick movement can be 0-100.

For the situations of outliers, we may miss some better opportunity to make money, but the model can be more stable if we ignore the outliers.

Overview Conclusion

There are no date conflicts, such as duplicate dates. There are no N/A values. But there are many price gaps. So even both of the next HIGH/LOW could be out of the range of the current day.

OPEN price instead of the previous prices is a better indicator for HIGH/LOW.

Feature Engineering

df['CLOSE-1']=df['CLOSE'].shift(1)       # The LAST price of the previous day
df['OPEN/LAST-1']=df['OPEN']/df['CLOSE-1'] # The ratio of the current OPEN with the LAST price of the prevous day
df['OPEN+1']=df['OPEN/LAST-1'].shift(-1) # The next OPEN price
df['OPEN/HIGH']=df['OPEN']/df['HIGH'] # The ratio of OPEN with HIGH, which belongs to (0,1]
df['LOW/OPEN']=df['LOW']/df['OPEN'] # The ratio of LOW with OPEN, which belongs to (0,1]

df.dropna(inplace=True)
df.head()

	OPEN	HIGH	LOW	CLOSE	ADJ CLOSE	VOLUME	HIGH_tick	LOW_tick	RANGE_TICKS	CLOSE-1	OPEN/LAST-1	OPEN+1	OPEN/HIGH	LOW/OPEN
DATE
2015-01-22	130.5	132.0	129.7	131.6	130.0	39063807	8	4	12	128.7	1.013986	1.022796	0.988636	0.993870
2015-01-23	134.6	134.9	131.1	132.7	131.1	29965533	2	18	20	131.6	1.022796	1.030143	0.997776	0.973997
2015-01-26	136.7	137.1	134.3	137.0	135.3	34952624	2	12	14	132.7	1.030143	1.007299	0.997082	0.982443
2015-01-27	138.0	138.0	133.0	136.0	134.3	24455759	0	25	25	137.0	1.007299	1.006618	1.000000	0.963768
2015-01-28	136.9	137.0	134.7	136.9	135.2	16216906	0	11	11	136.0	1.006618	0.993426	0.999270	0.983930

Outliers Processing by a customized scaler

To standardize the tick distribution

We will use our customized scale function to turn the tick data to the value of (0,1]. All outliers will be assign 1.

def capped_with_1(tick_value):
        return min(1, tick_value)
        
df["HIGH_tick_std"] = df.apply(lambda x: capped_with_1(x["HIGH_tick"]/MAX_RANGE_TICKS),axis=1)
df["LOW_tick_std"] = df.apply(lambda x: capped_with_1(x["LOW_tick"]/MAX_RANGE_TICKS),axis=1)
df["RANGE_TICKS_std"]=df.apply(lambda x: capped_with_1(x["RANGE_TICKS"]/MAX_RANGE_TICKS),axis=1)
#df["LOW_tick_std"].max()

df.dropna(inplace=True) # Due to shift(1), the first row has N/A data
df.head()

	OPEN	HIGH	LOW	CLOSE	ADJ CLOSE	VOLUME	HIGH_tick	LOW_tick	RANGE_TICKS	CLOSE-1	OPEN/LAST-1	OPEN+1	OPEN/HIGH	LOW/OPEN	HIGH_tick_std	LOW_tick_std	RANGE_TICKS_std
DATE
2015-01-22	130.5	132.0	129.7	131.6	130.0	39063807	8	4	12	128.7	1.013986	1.022796	0.988636	0.993870	0.08	0.04	0.12
2015-01-23	134.6	134.9	131.1	132.7	131.1	29965533	2	18	20	131.6	1.022796	1.030143	0.997776	0.973997	0.02	0.18	0.20
2015-01-26	136.7	137.1	134.3	137.0	135.3	34952624	2	12	14	132.7	1.030143	1.007299	0.997082	0.982443	0.02	0.12	0.14
2015-01-27	138.0	138.0	133.0	136.0	134.3	24455759	0	25	25	137.0	1.007299	1.006618	1.000000	0.963768	0.00	0.25	0.25
2015-01-28	136.9	137.0	134.7	136.9	135.2	16216906	0	11	11	136.0	1.006618	0.993426	0.999270	0.983930	0.00	0.11	0.11

Model 1: Return based LSTM model

If we can predict relative HIGH, LOW based on OPEN price, we can get the correponding real HIGH and LOW.

Range Prediction

The most difficluty for this model is that the task is too predict one HIGH and one LOW value, which define a range. If there is no constraint on the prediction, the predicted LOW could be higher the predicted HIGH. So we have to predict a reasonable range.

The trick we do is that we use OPEN/HIGH and OPEN/LOW as the indicators and the data to be predicted. These 2 generated columns have the range of (0,1]. And all combines of OPEN/HIGH and OPEN/LOW means a reasonable range. The hinted HIGH is always higher or equal to the LOW.

The critical activation function for the output layer

For the predicted values lies on (0,1], and they have no direct relation between them. So that sigmoid aactivation function suites to the output layer.

Of course, there are a few possible activation cadidates also. Such as: TanH, SQNL，Gaussian and SQ-RBF. But Sigmoid has good feature that the predicted values concentrate to 1, which fits our situation.

The detail on activation function can be found at:

Activation function

Preprocess

history_values = df[["RANGE_TICKS_std","OPEN/HIGH",	"LOW/OPEN"]].values
history_values.shape

(1229, 3)

SEQ_LEN = 20 # about 1 month


def to_sequences(data, seq_len):
    d = []

    for index in range(len(data) - seq_len):
        d.append((data[index: index + seq_len]))
        #d.append(data[index: index + seq_len])

    na=np.array(d)
    #print(na.shape)
    #print(na[:2])
    return na

def preprocess(data_raw, seq_len, train_split):

    data = to_sequences(data_raw, seq_len)

    num_train = int(train_split * data.shape[0])

    X_train = data[:num_train, :-1,: ]
    y_train = data[:num_train, -1, -2:]

    X_test = data[num_train:, :-1, :]
    y_test = data[num_train:,-1, -2:]

    return X_train, y_train, X_test, y_test


X_train, y_train, X_test, y_test = preprocess(history_values, SEQ_LEN, train_split = 0.95)

X_train.shape

(1148, 19, 3)

Model Define

This model is copied from a model to predict close price of a crypto currency. This model is selectd for the author argued the model has some good technical features.

A common LSTM should be fine too.

The difference is that here are 2 outputs with sigmoid as activation function.

DROPOUT = 0.2 # In case overfitting
WINDOW_SIZE = (SEQ_LEN - 1)

model = keras.Sequential()

model.add(Bidirectional(CuDNNLSTM(WINDOW_SIZE, return_sequences=True),
                        input_shape=(WINDOW_SIZE, X_train.shape[-1])))
model.add(Dropout(rate=DROPOUT))

model.add(Bidirectional(CuDNNLSTM((WINDOW_SIZE * 2), return_sequences=True)))
model.add(Dropout(rate=DROPOUT))

model.add(Bidirectional(CuDNNLSTM(WINDOW_SIZE, return_sequences=False)))

model.add(Dense(units=2))

model.add(Activation('sigmoid')) # So the value output is in the (0,1]

Training

model.compile(
    loss='mean_squared_error', 
    optimizer=keras.optimizers.Adam(
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-09,# We choose a much smaller value for the predicted value is always near 1, so the loss can be always very small. In order to distinguish 1 tick difference, we need a very small epsilon.
    amsgrad=False,
    name='Adam')
)

BATCH_SIZE = 64

history = model.fit(
    X_train, 
    y_train, 
    epochs=100, 
    batch_size=BATCH_SIZE, 
    shuffle=False,
    validation_split=0.1
)

Train on 1033 samples, validate on 115 samples

Training Log

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'][-50:])
plt.plot(history.history['val_loss'][-50:])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

model.evaluate(X_test, y_test)

4.558617681633208e-05

Model Predict

X_test.shape[0]

y_hat = model.predict(X_test)
df_test= df[-X_test.shape[0]:].reset_index()
df_test

	DATE	OPEN	HIGH	LOW	CLOSE	ADJ CLOSE	VOLUME	HIGH_tick	LOW_tick	RANGE_TICKS	CLOSE-1	OPEN/LAST-1	OPEN+1	OPEN/HIGH	LOW/OPEN	HIGH_tick_std	LOW_tick_std	RANGE_TICKS_std
0	2019-10-21	329.6	330.2	324.8	324.8	324.8	13947162	3	24	27	331.0	0.995770	1.000616	0.998183	0.985437	0.03	0.24	0.27
1	2019-10-22	325.0	327.8	324.8	327.6	327.6	10448427	14	1	15	324.8	1.000616	0.991453	0.991458	0.999385	0.14	0.01	0.15
2	2019-10-23	324.8	325.8	319.6	320.0	320.0	19855257	5	26	31	327.6	0.991453	0.996875	0.996931	0.983990	0.05	0.26	0.31
3	2019-10-24	319.0	320.6	316.6	319.0	319.0	18472498	8	12	20	320.0	0.996875	1.004389	0.995009	0.992476	0.08	0.12	0.20
4	2019-10-25	320.4	320.4	316.6	316.6	316.6	15789881	0	19	19	319.0	1.004389	0.999368	1.000000	0.988140	0.00	0.19	0.19
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
56	2020-01-13	400.0	406.4	397.4	406.4	406.4	27570261	32	13	45	398.6	1.003512	1.008858	0.984252	0.993500	0.32	0.13	0.45
57	2020-01-14	410.0	413.0	396.6	400.4	400.4	26827634	15	67	82	406.4	1.008858	0.992008	0.992736	0.967317	0.15	0.67	0.82
58	2020-01-15	397.2	403.0	396.2	398.8	398.8	15938138	29	5	34	400.4	0.992008	1.000502	0.985608	0.997482	0.29	0.05	0.34
59	2020-01-16	399.0	403.0	396.4	400.0	400.0	13770626	20	13	33	398.8	1.000502	1.000000	0.990074	0.993484	0.20	0.13	0.33
60	2020-01-17	400.0	400.6	396.0	399.0	399.0	13670846	3	20	23	400.0	1.000000	1.015038	0.998502	0.990000	0.03	0.20	0.23

61 rows × 18 columns


df_hat = pd.DataFrame(data =y_hat,columns=['OPEN/HIGH<hat>','LOW/OPEN<hat>'])
df_hat

	OPEN/HIGH<hat>	LOW/OPEN<hat>
0	0.990460	0.988908
1	0.990460	0.988910
2	0.990456	0.988902
3	0.990457	0.988905
4	0.990456	0.988903
...	...	...
56	0.990469	0.988909
57	0.990469	0.988913
58	0.990467	0.988908
59	0.990469	0.988905
60	0.990483	0.988921

61 rows × 2 columns

df_compare=pd.concat([df_test, df_hat], axis=1)

df_compare

	DATE	OPEN	HIGH	LOW	CLOSE	ADJ CLOSE	VOLUME	HIGH_tick	LOW_tick	RANGE_TICKS	CLOSE-1	OPEN/LAST-1	OPEN+1	OPEN/HIGH	LOW/OPEN	HIGH_tick_std	LOW_tick_std	RANGE_TICKS_std	OPEN/HIGH<hat>	LOW/OPEN<hat>
0	2019-10-21	329.6	330.2	324.8	324.8	324.8	13947162	3	24	27	331.0	0.995770	1.000616	0.998183	0.985437	0.03	0.24	0.27	0.990460	0.988908
1	2019-10-22	325.0	327.8	324.8	327.6	327.6	10448427	14	1	15	324.8	1.000616	0.991453	0.991458	0.999385	0.14	0.01	0.15	0.990460	0.988910
2	2019-10-23	324.8	325.8	319.6	320.0	320.0	19855257	5	26	31	327.6	0.991453	0.996875	0.996931	0.983990	0.05	0.26	0.31	0.990456	0.988902
3	2019-10-24	319.0	320.6	316.6	319.0	319.0	18472498	8	12	20	320.0	0.996875	1.004389	0.995009	0.992476	0.08	0.12	0.20	0.990457	0.988905
4	2019-10-25	320.4	320.4	316.6	316.6	316.6	15789881	0	19	19	319.0	1.004389	0.999368	1.000000	0.988140	0.00	0.19	0.19	0.990456	0.988903
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
56	2020-01-13	400.0	406.4	397.4	406.4	406.4	27570261	32	13	45	398.6	1.003512	1.008858	0.984252	0.993500	0.32	0.13	0.45	0.990469	0.988909
57	2020-01-14	410.0	413.0	396.6	400.4	400.4	26827634	15	67	82	406.4	1.008858	0.992008	0.992736	0.967317	0.15	0.67	0.82	0.990469	0.988913
58	2020-01-15	397.2	403.0	396.2	398.8	398.8	15938138	29	5	34	400.4	0.992008	1.000502	0.985608	0.997482	0.29	0.05	0.34	0.990467	0.988908
59	2020-01-16	399.0	403.0	396.4	400.0	400.0	13770626	20	13	33	398.8	1.000502	1.000000	0.990074	0.993484	0.20	0.13	0.33	0.990469	0.988905
60	2020-01-17	400.0	400.6	396.0	399.0	399.0	13670846	3	20	23	400.0	1.000000	1.015038	0.998502	0.990000	0.03	0.20	0.23	0.990483	0.988921

61 rows × 20 columns

From Prediction to Ticks (Practical Price)

The prediction is actually OPEN based HIGH and LOW return. We need to convert it back to real price in the units of TICK.

df_compare['HIGH_hat']=np.floor((df_compare['OPEN']/df_compare['OPEN/HIGH<hat>'])/TICK_UNIT)*TICK_UNIT
df_compare['LOW_hat']=np.ceil((df_compare['OPEN']*df_compare['LOW/OPEN<hat>'])/TICK_UNIT)*TICK_UNIT
df_compare['HIGH_diff']=df_compare['HIGH_hat']-df_compare['HIGH']
df_compare['LOW_diff']=df_compare['LOW_hat']-df_compare['LOW']
df_compare['HIGH_diff_tick']=(np.round(df_compare['HIGH_diff']/TICK_UNIT)).astype('int32')
df_compare['LOW_diff_tick']=(np.round(df_compare['LOW_diff']/TICK_UNIT)).astype('int32')
df_compare['RANGE_TICKS_hat']=(np.round((df_compare['HIGH_hat']-df_compare['LOW_hat'])/TICK_UNIT)).astype('int32')

df_compare[['DATE', 'OPEN', 'CLOSE', 'HIGH', 'HIGH_hat','HIGH_diff_tick', 'LOW','LOW_hat','LOW_diff_tick','RANGE_TICKS','RANGE_TICKS_hat']]

	DATE	OPEN	CLOSE	HIGH	HIGH_hat	HIGH_diff_tick	LOW	LOW_hat	LOW_diff_tick	RANGE_TICKS	RANGE_TICKS_hat
0	2019-10-21	329.6	324.8	330.2	332.6	12	324.8	326.0	6	27	33
1	2019-10-22	325.0	327.6	327.8	328.0	1	324.8	321.4	-17	15	33
2	2019-10-23	324.8	320.0	325.8	327.8	10	319.6	321.2	8	31	33
3	2019-10-24	319.0	319.0	320.6	322.0	7	316.6	315.6	-5	20	32
4	2019-10-25	320.4	316.6	320.4	323.4	15	316.6	317.0	2	19	32
...	...	...	...	...	...	...	...	...	...	...	...
56	2020-01-13	400.0	406.4	406.4	403.8	-13	397.4	395.6	-9	45	41
57	2020-01-14	410.0	400.4	413.0	413.8	4	396.6	405.6	45	82	41
58	2020-01-15	397.2	398.8	403.0	401.0	-10	396.2	392.8	-17	34	41
59	2020-01-16	399.0	400.0	403.0	402.8	-1	396.4	394.6	-9	33	41
60	2020-01-17	400.0	399.0	400.6	403.8	16	396.0	395.6	-2	23	41

61 rows × 11 columns

df_compare

	DATE	OPEN	HIGH	LOW	CLOSE	ADJ CLOSE	VOLUME	HIGH_tick	LOW_tick	RANGE_TICKS	CLOSE-1	OPEN/LAST-1	OPEN+1	OPEN/HIGH	LOW/OPEN	HIGH_tick_std	LOW_tick_std	RANGE_TICKS_std	OPEN/HIGH<hat>	LOW/OPEN<hat>	HIGH_hat	LOW_hat	HIGH_diff	LOW_diff	HIGH_diff_tick	LOW_diff_tick	RANGE_TICKS_hat
0	2019-10-21	329.6	330.2	324.8	324.8	324.8	13947162	3	24	27	331.0	0.995770	1.000616	0.998183	0.985437	0.03	0.24	0.27	0.990460	0.988908	332.6	326.0	2.4	1.2	12	6	33
1	2019-10-22	325.0	327.8	324.8	327.6	327.6	10448427	14	1	15	324.8	1.000616	0.991453	0.991458	0.999385	0.14	0.01	0.15	0.990460	0.988910	328.0	321.4	0.2	-3.4	1	-17	33
2	2019-10-23	324.8	325.8	319.6	320.0	320.0	19855257	5	26	31	327.6	0.991453	0.996875	0.996931	0.983990	0.05	0.26	0.31	0.990456	0.988902	327.8	321.2	2.0	1.6	10	8	33
3	2019-10-24	319.0	320.6	316.6	319.0	319.0	18472498	8	12	20	320.0	0.996875	1.004389	0.995009	0.992476	0.08	0.12	0.20	0.990457	0.988905	322.0	315.6	1.4	-1.0	7	-5	32
4	2019-10-25	320.4	320.4	316.6	316.6	316.6	15789881	0	19	19	319.0	1.004389	0.999368	1.000000	0.988140	0.00	0.19	0.19	0.990456	0.988903	323.4	317.0	3.0	0.4	15	2	32
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
56	2020-01-13	400.0	406.4	397.4	406.4	406.4	27570261	32	13	45	398.6	1.003512	1.008858	0.984252	0.993500	0.32	0.13	0.45	0.990469	0.988909	403.8	395.6	-2.6	-1.8	-13	-9	41
57	2020-01-14	410.0	413.0	396.6	400.4	400.4	26827634	15	67	82	406.4	1.008858	0.992008	0.992736	0.967317	0.15	0.67	0.82	0.990469	0.988913	413.8	405.6	0.8	9.0	4	45	41
58	2020-01-15	397.2	403.0	396.2	398.8	398.8	15938138	29	5	34	400.4	0.992008	1.000502	0.985608	0.997482	0.29	0.05	0.34	0.990467	0.988908	401.0	392.8	-2.0	-3.4	-10	-17	41
59	2020-01-16	399.0	403.0	396.4	400.0	400.0	13770626	20	13	33	398.8	1.000502	1.000000	0.990074	0.993484	0.20	0.13	0.33	0.990469	0.988905	402.8	394.6	-0.2	-1.8	-1	-9	41
60	2020-01-17	400.0	400.6	396.0	399.0	399.0	13670846	3	20	23	400.0	1.000000	1.015038	0.998502	0.990000	0.03	0.20	0.23	0.990483	0.988921	403.8	395.6	3.2	-0.4	16	-2	41

61 rows × 27 columns

The Results Are Not Good Although the Training Seems Good

How to improve

Data

To improve the effectiveness of prediction, we need to add more data:

Instead of a single name, we need a big pool of names to train the model. The single name provides very limited data, which is very limited to train a model. Also, we will use the model to predict not only a single name but also for other names. And it’s better to have the data of the market index.
Instead of real HIGH and LOW values, we may need a more practical ones. The HIGH and LOW are extreme data, which is not stable and could be impacted by 1 price spike. Even we have predicted the extreme data correctly, which may be not feasible to fill an order. So instead of to predicate the HIGH and LOW, we may predicate the 95% confidence range based on the VWAP. Which could be more realistic.
Instead of a daily volume, we use an array of volume at price for the day. People believe that more volume means more evidence. So the extreme data (high/ low) can be predicted with more confidence.
More reliable data, such as VWAP of AM, PM.

Model

We may try other non-machine learning models, such as Fractionally cointegrated vector autoregressive model.
We may try other machine learning models, even auto-AI package.
We may tune hyperparameters.

Conclusion

This repository was digested from a big project I did. In that project more feature engineering and extra model have been done and the results are good.

In this repsitory I didn't tune the hyperparameters yet. This is just for demo, but aim for a practical usage.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Stock_HIGH_LOW_Prediction_files		Stock_HIGH_LOW_Prediction_files
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Stock_HIGH_LOW_Prediction.ipynb		Stock_HIGH_LOW_Prediction.ipynb

License

MRYingLEE/Stock-HIGH-LOW-Prediction

Folders and files

Latest commit

History

Repository files navigation