# 神经网络 (Deep Correlation Model)

- 总体思路：在传统使用机器学习等方式进行因子组合优化的方式中，通常将问题定义为回归问题，损失函数通常具有类似MSE的形式。也即
$$Loss1=-\sum_t (y_{t+1} - f(\bm{x}_{t,i}))^2$$ 
使用此类损失函数更多拟合的是$y_{t+1}$和$f(\bm{x}_{t,i})$均值之间的关系而损失了顺序信息。为此，我们考虑寻找信噪比更高的对象进行学习，将优化目标定为IC。在因子合成中，优化IC的目标函数为
$$Loss2=-\sum_t Corr(y_{t+1}, f(\bm{x}_{t,i}))$$ 
其将每期的样本看做一个整体，对整体的结果计算相关性作为损失，较单一样本直接加总的损失有更高的信噪比。由于IC描述的是整体样本的相关性，局部可能出现与整体截然相反的分布，比如整体的IC是正值，但在因子值较高的区域，IC是负值。这一结果可能会显著的影响多头选股的效果。为了更好的适应\textbf{多头选股}的任务，可采用加权的相关系数，根据因子值从高到低采用指数衰减权重

$$w_i=\left(\frac{1}{2}\right)^{\frac{i-1}{n-1}}, \quad i=1,\dots,n$$ 

在上述加权的基础上，Weighted IC计算如下：
$$\mathbb{E}[x|w]=\sum_i w_i x_i, \quad \mathbb{E}[y|w]=\sum_i w_i y_i$$ 
$$Var[x|w]=\mathbb{E}[x^2|w]-\mathbb{E}[x|w]^2=\sum_i w_i x_i^2-\left(\sum_i w_i x_i\right)^2$$ 
$$Var[y|w]=\mathbb{E}[y^2|w]-\mathbb{E}[y|w]^2=\sum_i w_i y_i^2-\left(\sum_i w_i y_i\right)^2$$ 
$$Cov(x,y|w)=\sum_i w_i x_i y_i - \left(\sum_i w_i x_i\right)\left(\sum_i w_i y_i\right)$$ 
$$Corr(x,y|w)=\frac{Cov(x,y|w)}{\sqrt{Var[x|w]Var[y|w]}}$$ 
使用该方法可以使得模型更关注头部（因子值较高时）的相关性。

- 网络结构: 
考虑到输入数据的形式为横截面数据，为减轻过拟合，我们采用类似多层感知机的三层网络结构，每层分别包含一个64/128/64节点的全连接层和批次标准化层，使用ReLU函数激活。

- 模型训练时的损失函数计算中，需将模型输出值$\hat{y}_{t+1,i}=f(\bm{x}_{t,i})$排序后计算上述加权IC值作为损失函数，从而通过反向传播更新参数，训练模型。

In [None]:
import numpy as np
import pandas as pd

# alphas = pd.read_csv("data_residbarrarsector.csv")
alphas = pd.read_csv("data_cutnorm.csv")
# barra = pd.read_hdf('barrar_risk.h5')
base_data = pd.read_csv("base_data.csv").dropna(subset=['adj_ret_p1'])
data = pd.merge(alphas, base_data, on=['date', 'cn_code'], how='inner')

In [None]:
alphas['year'] = alphas['date'] // 10000
data = pd.merge(alphas, base_data, on=['date', 'cn_code'], how='inner')
data['year'] = data['date'] // 10000
alpha_cols = alphas.columns.drop(['cn_code', 'date'])

In [None]:
from keras.layers import Flatten, Dense, Input
from keras.models import Model
from keras import backend as K
from keras.optimizers import Adam
import tensorflow as tf
from tensorflow import convert_to_tensor

# 半衰加权方式
def get_halflife_weights(y_pred):
    l = y_pred.shape[0]
    weights = np.array([np.power(0.5, (i - 1) / (l - 1)) \
        for i in range(l, 0, -1)]) # 半衰加权
    y_pred_ranks = np.argsort(np.argsort(y_pred)) # 两次argsort获取rank
    weights = weights[y_pred_ranks] / sum(weights) # 归一化
    weights = weights.astype('float32')
    return convert_to_tensor(weights)

# 定义加权IC，使用keras后端实现
def weighted_ic(y_true, y_pred):
    weights = get_halflife_weights(y_pred)
    mean_true = K.sum(y_true * weights)
    mean_pred = K.sum(y_pred * weights)

    var_true = K.sum(K.square(y_true) * weights) - \
        K.square(K.sum(y_true * weights))
    var_pred = K.sum(K.square(y_pred) * weights) - \
        K.square(K.sum(y_pred * weights))
    
    cov = K.sum(weights * y_true * y_pred) - mean_true * mean_pred
    corr = cov / (K.sqrt(var_pred) * K.sqrt(var_true))
    return -corr

In [None]:
from keras.layers import Flatten, Dense, Input, BatchNormalization
from keras.models import Model
from keras import backend as K
from keras.optimizers import Adam
K.clear_session() # 清除先前训练的模型

# 模型结构
def MODEL():
    input_size = 32 # 输入因子个数，本研究为32个
    input_layer = Input(input_size) #输入层
    x = input_layer # 继承输入层
    x = Dense(64, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dense(128, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Dense(64, activation='relu')(x)
    x = BatchNormalization()(x)
    x = Flatten()(x) #张量拉直
    x = Dense(1)(x)
    output_layer = x #输出层
    model = Model(input_layer, output_layer) #模型组装
    # model.summary() #模型细节展示
    return model

In [None]:
# 按年滚动训练
for train_year in range(2015, 2023):
    print("For Train Year {:d}".format(train_year))
    # 取出训练和测试样本，按年为划分单位
    x_train = data[alpha_cols][data.year == train_year].drop(columns='year').astype('float32')
    y_train = data.drop(columns=['date', 'cn_code'])[data.year == train_year].drop(columns='year')['adj_ret_p1'].astype('float32')

    x_test = data[alpha_cols][data.year == train_year + 1].drop(columns='year').astype('float32')
    y_test = data.drop(columns=['date', 'cn_code'])[data.year == train_year + 1].drop(columns='year')['adj_ret_p1'].astype('float32')

    # 清除先前模型
    K.clear_session()
    # 构建模型
    model = MODEL()
    # 使用Adam优化器进行训练，损失函数为Weighted IC
    model.compile(optimizer = Adam(0.0001),
              loss = weighted_ic,
              metrics = [weighted_ic],
              run_eagerly=True)
    # 训练模型，设置batch_size为3000（抽取约一天的股池样本），训练5个epoch
    model.fit(x_train, y_train,
            validation_data = (x_test, y_test),
            batch_size = 3000,
            epochs = 5)

    # 获得外样本预测结果，计算IC
    y_pred_test = model.predict(x_test).reshape(1, -1)[0]
    print("IC For Predicted Alpha in {:d}: {:.4f}".format(train_year + 1, np.corrcoef(y_pred_test, y_test)[0][1]))
    # 保存结果
    if(train_year == 2015):
        pred_results = y_pred_test
    else:
        pred_results = np.hstack([pred_results, y_pred_test])
    print("--------------------------------------------------------")

For Train Year 2020
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
IC For Predicted Alpha in 2021: 0.0052
--------------------------------------------------------
For Train Year 2021
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
IC For Predicted Alpha in 2022: 0.0041
--------------------------------------------------------
For Train Year 2022
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
IC For Predicted Alpha in 2023: -0.0153
--------------------------------------------------------


上述只展示了2020-2022年数据的训练结果，其余年份训练结果类似。

In [None]:
pred_results

array([ 0.86695486,  0.32839933, -0.5326828 , ..., -2.0387452 ,
        0.15744704,  2.5687876 ], dtype=float32)

In [None]:
# 添加日期和股票代码
pred_results_df = data[data.year >= 2014][['date', 'cn_code']].reset_index().drop(columns='index')
pred_results_df['weighted_ic_nn'] = pred_results

In [None]:
# 导出结果
pred_results_df.to_hdf("alpha_aggregations_weighted_ic_nn.h5", key='stage', mode='w')