<header style=" text-align: center; color:blue; font-size:45px">Order Imbalance Based Strategy</header>

<h1 style="color:purple">Author</h1>

  
- Name:   LUO Yiling                   
- Student ID:   20881826


## Introduction

Many studies have been conducted to describe the relationship between the trade activity(volume) and the price change. Ask (Sell) and Bid (Buy) orders on the order book might signal the direction of market movement. To  model the possible pattern between the order imbalance (between ask and bid sides) and price change, we will build new predictors and check the performance of multiple linear regression in paper trading (backtesting with historical data). 



## Data

Totally we have 10 days' data

In [65]:
import pandas as pd
import numpy as np
import os, math, pathlib
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
from os import listdir
from multiprocessing import Process
import warnings
warnings.filterwarnings('ignore')

First we read the train and test data

In [66]:
url_train = 'https://drive.google.com/file/d/1z7Kz_cNK0_E4SIGuZvve_ALjlYUYdJBL/view?usp=share_link'
url_train = 'https://drive.google.com/uc?export=download&id='+url_train.split('/')[-2]

In [67]:
url_test= 'https://drive.google.com/file/d/1z7HRZUI1H-2HUqwZbr-Mk63ecBxxgh1O/view?usp=share_link'
url_test = 'https://drive.google.com/uc?export=download&id='+url_test.split('/')[-2]

In [68]:
train=pd.read_csv(url_train,index_col=0)
train.index=pd.to_datetime(train.index)
test=pd.read_csv(url_test,index_col=0)
test.index=pd.to_datetime(test.index)
train.head()

Unnamed: 0,AskPrice,BidPrice,AskVolume,BidVolume
2020-12-21 09:00:00.500,4445.0,4419.0,467.0,87.0
2020-12-21 09:00:01.000,4445.0,4444.0,88.0,67.0
2020-12-21 09:00:01.500,4433.0,4420.0,6.0,462.0
2020-12-21 09:00:02.000,4430.0,4428.0,40.0,9.0
2020-12-21 09:00:02.500,4427.0,4426.0,1.0,81.0


It is high frequency historical data(0.5 seconds) for a commodity future traded in Shanghai Futures Exchange. AskPrice tis the lowest price that the seller wants to sell and the BidPrice is the highest price the buyer can offer. AskVolume and BidVolume are the number of shares in the markets sellers and buyers offers. Some are not patient, i.e. the seller lowers the price and trades with buyers successfully. Theorefore the Ask/Bid Prices and Ask/Bid Volumes changes over the time. In this project, we will use these four information to forcast the the direction of the market. It is a forecasting project, so do not be surprised if the adjusted R^2 is low. 

## Task 1: Build Features (70%)
For both training and test data, build predictors and response

#### mid price

First build mid price
$$Mid_t=\frac{AskPrice_t+BidPrice_t}{2}$$

In [69]:
train['Mid_t'] = (train['AskPrice'] + train['BidPrice']) / 2

In [70]:
test['Mid_t'] = (test['AskPrice'] + test['BidPrice']) / 2

#### Y variable

The frequency of the data is 0.5 second. We want to predict the price change in 5 seconds. Build Y variable 
$$
Y_t=\frac{Mid_{t+1}+Mid_{t+2}+\ldots+Mid_{t+10}}{10}-Mid_t
$$
That is, we calculate the average changes of prices for the next 5 seconds.  

In [71]:
train['Y'] = train['Mid_t'].shift(-10).rolling(10).mean() - train['Mid_t']

In [72]:
test['Y'] = test['Mid_t'].shift(-10).rolling(10).mean() - test['Mid_t']

Next we build predictors 

### Volume Order ImBalance

First Build 
$$
OI_t=\delta V_t^B-\delta V_t^A
$$

where $\delta V_t^A$ is to the measure the power on the buying side and $\delta V_t^A$ is to measure the power on the selling side. 

$$
\delta V_t^B=\left\{\begin{array}{cl}0, & \mbox { if } P_t^B<P_{t-1}^B\\V_t^B-V_{t-1}^B, & \mbox { if } P_t^B=P_{t-1}^B\\V_{t}^B, & \mbox { if } P_t^B>P_{t-1}^B\end{array}\right.
$$
and 
$$
\delta V_t^A=\left\{\begin{array}{cl}V_t^A, & \mbox { if } P_t^A<P_{t-1}^A\\V_t^A-V_{t-1}^A, & \mbox { if } P_t^A=P_{t-1}^A\\0, & \mbox { if } P_t^A>P_{t-1}^A\end{array}\right.
$$

where $V_t^A,V_t^B$ are ask and bid volumes, $P_t^A,P_t^B$ are ask and bid prices

Use shift(2),shift(4) ... operators, we  build $OI_{t-2},OI_{t-4},OI_{t-6},OI_{t-8}, OI_{t-10}$, totally 6 predictors including $OI_t$

#### $OI_t$

##### Method 1: using **np.where**

Using the map function to transform to different scenarios, and then use np.where twice to transform the result. As there may exist that BidVolume equals to 1, thus change to a decimal 0.5 would solve the problem.

In [73]:
train['AskCompare'] = train['AskPrice'].diff().map(lambda x: 0 if (x > 0) else -0.5 if (x == 0) else -2)
train['AskCompare'] = train['AskCompare'].where(train['AskCompare'] >= -1, train['AskVolume']).where(train['AskCompare'] != -0.5, train['AskVolume'].diff())
train['BidCompare'] = train['BidPrice'].diff().map(lambda x: 2 if (x > 0) else 0.5 if (x == 0) else 0)
train['BidCompare'] = train['BidCompare'].where(train['BidCompare'] <= 1, train['BidVolume']).where(train['BidCompare'] != 0.5, train['BidVolume'].diff())
train['OI_t'] = train['BidCompare'] - train['AskCompare']
train = train.drop(columns=(['AskCompare','BidCompare']))

In [74]:
test['AskCompare'] = test['AskPrice'].diff().map(lambda x: 0 if (x > 0) else -0.5 if (x == 0) else -2)
test['AskCompare'] = test['AskCompare'].where(test['AskCompare'] >= -1, test['AskVolume']).where(test['AskCompare'] != -0.5, test['AskVolume'].diff())
test['BidCompare'] = test['BidPrice'].diff().map(lambda x: 2 if (x > 0) else 0.5 if (x == 0) else 0)
test['BidCompare'] = test['BidCompare'].where(test['BidCompare'] <= 1, test['BidVolume']).where(test['BidCompare'] != 0.5, test['BidVolume'].diff())
test['OI_t'] = test['BidCompare'] - test['AskCompare']
test = test.drop(columns=(['AskCompare','BidCompare']))

##### Method 2: define function
Time complexity: several minutes

In [75]:
# def deltaB(VB,PB):
#   delta_B = []
#   for i in range(len(VB)):
#     if PB[i] < PB.shift(1)[i]:
#       delta_B.append(0)
#     elif PB[i] == PB.shift(1)[i]:
#       delta_B.append(VB[i] - VB.shift(1)[i])
#     else:
#       delta_B.append(VB[i])
#   return delta_B

# def deltaA(VA,PA):
#   delta_A = []
#   for i in range(len(VA)):
#     if PA[i] < PA.shift(1)[i]:
#       delta_A.append(VA[i])
#     elif PA[i] == PA.shift(1)[i]:
#       delta_A.append(VA[i] - VA.shift(1)[i])
#     else:
#       delta_A.append(0)
#   return delta_A

# delta_A = deltaA(train['AskVolume'],train['AskPrice'])
# delta_B = deltaB(train['BidVolume'],train['BidPrice'])
# delta_A_test = deltaA(test['AskVolume'],test['AskPrice'])
# delta_B_test = deltaB(test['BidVolume'],test['BidPrice'])
# train['OI_t'] = np.array(delta_B) - np.array(delta_A)
# test['OI_t'] = np.array(delta_B_test) - np.array(delta_A_test)

##### Method 3: matrix calculation

\begin{equation}
  \delta V_t^B=
  \begin{bmatrix}
        I_{P_t^B<P_{t-1}^B} 
        & I_{P_t^B=P_{t-1}^B} &
        I_{P_t^B>P_{t-1}^B}
    \end{bmatrix}
    \begin{bmatrix}
        0 \\
        V_t^B-V_{t-1}^B \\
        V_t^B \\
    \end{bmatrix}
\end{equation}
then take the diagonal elements

#### $OI_{t-2},OI_{t-4},OI_{t-6},OI_{t-8}, OI_{t-10}$

In [76]:
train['OI_t_2'] = train['OI_t'].shift(2)
train['OI_t_4'] = train['OI_t'].shift(4)
train['OI_t_6'] = train['OI_t'].shift(6)
train['OI_t_8'] = train['OI_t'].shift(8)
train['OI_t_10'] = train['OI_t'].shift(10)

In [77]:
test['OI_t_2'] = test['OI_t'].shift(2)
test['OI_t_4'] = test['OI_t'].shift(4)
test['OI_t_6'] = test['OI_t'].shift(6)
test['OI_t_8'] = test['OI_t'].shift(8)
test['OI_t_10'] = test['OI_t'].shift(10)

In [78]:
train = train.dropna()
train

Unnamed: 0,AskPrice,BidPrice,AskVolume,BidVolume,Mid_t,Y,OI_t,OI_t_2,OI_t_4,OI_t_6,OI_t_8,OI_t_10
2020-12-21 09:00:05.500,4422.0,4421.0,4.0,105.0,4421.5,-3.75,187.0,-55.0,-275.0,-1.0,-6.0,-467.0
2020-12-21 09:00:06.000,4422.0,4421.0,460.0,29.0,4421.5,-4.65,-532.0,-31.0,-9.0,-27.0,-31.0,446.0
2020-12-21 09:00:06.500,4422.0,4421.0,126.0,16.0,4421.5,-5.45,321.0,187.0,-55.0,-275.0,-1.0,-6.0
2020-12-21 09:00:07.000,4421.0,4420.0,861.0,593.0,4420.5,-5.00,-861.0,-532.0,-31.0,-9.0,-27.0,-31.0
2020-12-21 09:00:08.000,4419.0,4418.0,670.0,57.0,4418.5,-3.50,-670.0,321.0,187.0,-55.0,-275.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2021-01-05 14:59:53.000,4398.0,4397.0,353.0,231.0,4397.5,0.00,-326.0,2.0,-36.0,36.0,-15.0,-35.0
2021-01-05 14:59:53.500,4398.0,4397.0,375.0,241.0,4397.5,0.00,-12.0,-19.0,180.0,93.0,-16.0,-85.0
2021-01-05 14:59:54.000,4398.0,4397.0,370.0,370.0,4397.5,0.00,134.0,-326.0,2.0,-36.0,36.0,-15.0
2021-01-05 14:59:54.500,4398.0,4397.0,348.0,364.0,4397.5,0.00,16.0,-12.0,-19.0,180.0,93.0,-16.0


In [79]:
test = test.dropna()
test

Unnamed: 0,AskPrice,BidPrice,AskVolume,BidVolume,Mid_t,Y,OI_t,OI_t_2,OI_t_4,OI_t_6,OI_t_8,OI_t_10
2021-01-05 21:00:05.500,4406.0,4405.0,66.0,53.0,4405.5,1.75,53.0,-1.0,156.0,-1.0,103.0,-12.0
2021-01-05 21:00:06.000,4407.0,4405.0,110.0,129.0,4406.0,1.60,76.0,86.0,184.0,-97.0,-172.0,73.0
2021-01-05 21:00:06.500,4407.0,4406.0,16.0,97.0,4406.5,1.60,191.0,53.0,-1.0,156.0,-1.0,103.0
2021-01-05 21:00:07.000,4409.0,4407.0,117.0,27.0,4408.0,0.45,27.0,76.0,86.0,184.0,-97.0,-172.0
2021-01-05 21:00:07.500,4408.0,4406.0,51.0,179.0,4407.0,2.00,-51.0,191.0,53.0,-1.0,156.0,-1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2021-01-08 14:59:53.000,4487.0,4485.0,84.0,140.0,4486.0,0.15,17.0,92.0,49.0,-155.0,6.0,-18.0
2021-01-08 14:59:53.500,4487.0,4486.0,75.0,79.0,4486.5,-0.55,88.0,14.0,-162.0,-15.0,-37.0,122.0
2021-01-08 14:59:54.000,4488.0,4486.0,205.0,1.0,4487.0,-1.30,-78.0,17.0,92.0,49.0,-155.0,6.0
2021-01-08 14:59:54.500,4488.0,4487.0,198.0,54.0,4487.5,-1.90,61.0,88.0,14.0,-162.0,-15.0,-37.0


## Task 2: Make Multiple Linear Regression (30%)

Build multiple linear regression to predict $Y_t$ with $[OI_t, OI_{t-2},OI_{t-4},OI_{t-6},OI_{t-8}, OI_{t-10}]$ in training data. 

Please evaluate the performance of your modesl with statistical measure (MSE and adjusted Rsquare).  You should not feel surprised if the adjusted Rsquare is very low. 


In [80]:
model = smf.ols(formula='Y~OI_t+OI_t_2+OI_t_4+OI_t_6+OI_t_8+OI_t_10', data=train)
reg = model.fit()
print(reg.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     579.2
Date:                Thu, 15 Dec 2022   Prob (F-statistic):               0.00
Time:                        11:43:22   Log-Likelihood:            -3.1553e+05
No. Observations:              223206   AIC:                         6.311e+05
Df Residuals:                  223199   BIC:                         6.312e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0008      0.002     -0.380      0.7

It shows that the lag longer than 5 days may have little impact of the model.

From the multiple linear regression results, we can see that the **Adj. R-squared** is 0.015, which means only 1.5% of the variance in the average changes of prices for the next 5 seconds is explained by the $OI_{t}$. But this result is reasonable as the market price is unpredictable, so only a little prediction power may help in the real market.

In [81]:
predict = reg.predict(sm.add_constant(test[['OI_t','OI_t_2','OI_t_4','OI_t_6','OI_t_8','OI_t_10']]))
predict

2021-01-05 21:00:05.500    0.037587
2021-01-05 21:00:06.000    0.043439
2021-01-05 21:00:06.500    0.174295
2021-01-05 21:00:07.000   -0.002348
2021-01-05 21:00:07.500   -0.092421
                             ...   
2021-01-08 14:59:53.000   -0.006846
2021-01-08 14:59:53.500    0.096995
2021-01-08 14:59:54.000   -0.085668
2021-01-08 14:59:54.500    0.038299
2021-01-08 14:59:55.000   -0.003977
Length: 122571, dtype: float64

In [82]:
mse = ((test['Y'] - predict) ** 2).sum() / (len(test) - 6 - 1)
mse

0.4029779644080539

The MSE of the model is 0.403.

## Reference 

1. [Paper about Order Imbalance](https://drive.google.com/file/d/1z5t2u1jKDtZjLBa_BZIDBY0DsRRhXQ-S/view?usp=sharing)