# STAT4012 Group Project
- The file is used to test the code for the project.  
- It should be written with **clear comments** and **explanations** on each sections of the code.  
- If there are some other modules that need to be imported or run in this file, you can use %load filename.py to load the code from the file. use %run filename.py to run the code from the file.  

In [13]:
# import libraries
# all the libraries are imported here but not below
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [14]:
data = pd.read_excel('../data/raw_data_adjusted.xlsx',index_col=0).dropna(how='any')
data.sort_values(by='date',ascending=True,inplace=True)

## 1.Preprocessing

### 1.1 calculate 6th day's intraday return -> label

In [15]:
data['sixth_day_return'] = ((data['close'].shift(-5)-data['open'].shift(-5))/data['open'].shift(-5)).dropna(how='any')
data['sixth_day_return']

date
2017-11-28    0.005632
2017-11-29    0.043850
2017-11-30   -0.002437
2017-12-01    0.001688
2017-12-04    0.045387
                ...   
2023-04-06         NaN
2023-04-10         NaN
2023-04-11         NaN
2023-04-12         NaN
2023-04-13         NaN
Name: sixth_day_return, Length: 1352, dtype: float64

### 1.2 mark golden cross point and death cross point

In [16]:
data['diff'] = np.sign(data["MA_5"] - data["MA_25"])
data['signal'] = np.sign(data['diff'] - data['diff'].shift(1))
data['golden_cross'] = data['signal'].map({1:1,0:0,-1:0})
data['death_cross'] = data['signal'].map({-1:1,0:0,1:0})
data = data.drop(columns=['diff','signal']).dropna(how='any')
print(data[['golden_cross','death_cross']])

            golden_cross  death_cross
date                                 
2017-11-29           0.0          0.0
2017-11-30           0.0          0.0
2017-12-01           0.0          0.0
2017-12-04           0.0          0.0
2017-12-05           0.0          0.0
...                  ...          ...
2023-03-30           0.0          0.0
2023-03-31           0.0          0.0
2023-04-03           0.0          0.0
2023-04-04           0.0          0.0
2023-04-05           0.0          0.0

[1346 rows x 2 columns]


### 1.3 feature scaling: use normalization as extreme values are rare

In [17]:
data = data.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
print(data.head)

<bound method NDFrame.head of                 open     close      high       low  daily_trading_value  \
date                                                                      
2017-11-29  0.022734  0.021534  0.021775  0.021029             0.013371   
2017-11-30  0.021276  0.021753  0.020565  0.021589             0.004283   
2017-12-01  0.020755  0.021365  0.020502  0.021675             0.004137   
2017-12-04  0.020932  0.021142  0.020161  0.020924             0.007211   
2017-12-05  0.020180  0.020891  0.020117  0.020990             0.004788   
...              ...       ...       ...       ...                  ...   
2023-03-30  0.459460  0.460630  0.459853  0.463660             0.139262   
2023-03-31  0.464342  0.491230  0.485870  0.470718             0.226156   
2023-04-03  0.470301  0.459349  0.473184  0.458024             0.215995   
2023-04-04  0.463816  0.453847  0.463372  0.453250             0.157554   
2023-04-05  0.446778  0.436110  0.443313  0.436595             0.16119

## 2.Feature Engineering

### PCA 

## 3. Build Model

### 3.0 hyperparameter tuning

### 3.1 Convolutional layer & MLP  （实验组）

In [18]:
from numpy import array
from numpy import hstack
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

In [19]:
# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
    X, y = list(), list()
    for i in range(len(sequences)):
        # find the end of this pattern
        end_ix = i + n_steps
        # check if we are beyond the dataset
        if end_ix > len(sequences):
            break
    # gather input and output parts of the pattern
        seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix - 1, -1]
        X.append(seq_x)
        y.append(seq_y)
    return array(X), array(y)

In [23]:
# choose a number of time steps
n_steps = 5
# convert into input/output
X, y = split_sequences(data.to_numpy(), n_steps)
print(X.shape, y.shape)


(1342, 5, 20) (1342,)


In [34]:
# split for training set and testing set
X_train, X_test = X[:1000,:,:],X[1000:,:,:]
y_train, y_test = y[:1000],y[1000:]


In [26]:
n_features = X.shape[2]
n_features

20

In [27]:
# define model
model = Sequential()
model.add(Conv1D(filters=64, kernel_size=2, activation='relu', input_shape=(n_steps, n_features)))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

In [35]:
# fit model
model.fit(X_train, y_train, epochs=1000, verbose=1)

<keras.callbacks.History at 0x1b8f5917ec8>

In [38]:
# demonstrate prediction
y_pred = model.predict(X_test, verbose=1)
print(y_pred)

[[ 2.19094604e-02]
 [ 6.60218298e-03]
 [ 8.89711082e-03]
 [-6.58150017e-03]
 [ 4.60146368e-03]
 [-2.41778791e-03]
 [ 6.22861087e-03]
 [ 2.68338621e-03]
 [ 4.71566394e-02]
 [ 5.85018098e-03]
 [ 1.37078911e-02]
 [-7.33311474e-03]
 [-1.14382803e-03]
 [ 6.49707019e-03]
 [ 9.41607356e-03]
 [ 1.47939026e-02]
 [ 1.66282505e-02]
 [ 3.15901726e-01]
 [ 1.07730165e-01]
 [ 4.13168967e-03]
 [-3.86161357e-03]
 [ 8.12457502e-03]
 [-1.71574205e-03]
 [-6.11044466e-04]
 [ 1.88579977e-01]
 [ 6.38701245e-02]
 [ 1.75325125e-02]
 [-1.45632327e-01]
 [-1.44555718e-02]
 [ 8.26077908e-03]
 [-4.12901491e-03]
 [ 3.35633755e-04]
 [-7.23157078e-03]
 [-2.76416540e-06]
 [-3.80207598e-03]
 [ 1.81714520e-02]
 [ 1.14639103e-03]
 [ 2.77668238e-04]
 [ 4.22768712e-01]
 [-3.17627192e-03]
 [-6.53620064e-03]
 [-4.41937149e-03]
 [ 1.42366439e-03]
 [-5.82886487e-03]
 [-3.58064473e-03]
 [-7.66374171e-03]
 [-8.75477493e-03]
 [-1.26308352e-02]
 [ 1.66125804e-01]
 [ 7.23407626e-01]
 [ 8.72895867e-02]
 [-8.94589722e-03]
 [ 2.9459744

#### Model evaluation

In [42]:
from keras import losses
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])
model.evaluate(X_test,y_test)



[7.668434776064714e-09, 0.9415204524993896]

### 3.2 Attention-based LSTM model （实验组）
structure: data -> LSTM layer -> Attention layer -> Dense layer -> prediction

### 3.3 Simple LSTM （对照组1）

### 3.4 ARIMA model（对照组2）

## 4. Model Evaluation

### 4.1 MSE, scores
### 4.2 Back-testing