Transformer及其衍生架构，在自然语言处理上取得了卓越的成果。但是，在将这一范式迁移到时间序列预测上来的时候，却遇到了尴尬的打不过线性模型的困难

首先我们分析一下，NLP（自然语言处理）和TSF（时间序列预测）两个问题的主要差别：

1. 自然语言中的语义既存在在每个单词中，也存在在单词之间的序列关系上。一个句子完全打乱单词顺序，也能保留部分信息（尽管不那么准确）。一个还不懂语法的语言学习者，仅靠单词也可以和其他人勉强交流。但是对于时间序列，打乱顺序就意味着完全丢失信息，可以说时间序列的信息绝大部分都隐藏在序列之中。

2. 自然语言的具有高度的一致性和可迁移性，常见单词和词组的含义在绝大多数语料中都是相近相似的，虽然会有一些多义词但毕竟是少数。而不同的时间序列即使出现了相同的形态，也不能说就有相似含义。例如，在金融领域，某些价格形态会包含价格趋势信息，其底层的逻辑是多空双方的博弈导致的，但是如果这样的形态出现在例如气温序列中，就不能说表示趋势性，因为底层的逻辑完全不一样。

3. 自然语言的训练集非常丰富，在人类历史上积累了大量的训练语料。但由于时间序列的含义差距，每个领域的时间序列是有限的，只能使用当前研究的框架内的数据。一般资产的数据有10年以上已经是非常丰富的历史了。如果扩展序列就会面临结构和范式的变化。

4. 自然语言的模式迁移非常缓慢，几乎可以忽略不记，虽然人类的语言会有所发展和变化，但是这种变化都是以数十年为单位的，在短期内改变的只会有少数词的词义，大的语法是不会改变的。但对于金融数据。概念漂移是非常常见的，时间序列的底层因素，例如次贷危机、疫情的出现很可能直接导致资产的模式完全改变，从而让历史数据的价值大打折扣，进一步加剧了数据量的问题。

Transformer架构能在NLP上取得成功的原因，恰恰也是Transformer架构不能被直接迁移到TSF上的原因：

1. RNN架构的顺序结构会影响长距离信息传递，长距离信息要么随着梯度消失，要么产生梯度爆炸。为了能顺利捕捉长距离关系，Transformer架构可以放弃了RNN架构的顺序性，转而使用并行性保护远距离信息可以顺利传播；

2. 因为采用了并行架构丢失了顺序信息，Transformer架构采用位置编码补齐丢失的顺序信息。但位置编码会影响一部分原始语义信息；

3. 因为训练语料足够丰富，导致位置编码的影响可以被最小化；

4. 平行架构也可以充分运用算力，大幅度加速训练过程，因此可以接受更复杂的模型层数。将牺牲的部分通过更大的模型来弥补。

换言之，因为自然语言的训练资料足够丰富，足以掩盖Transformer架构的缺点，充分发挥Transformer架构的优势，才使得Transformer架构得以在NLP问题上大放异彩。但反过来，在TSF问题上，Transformer架构并没有这样的优势。而其劣势，会被时间序列数据量缺乏的问题放大。Transformer架构本身就很复杂，模型的参数量越大，需要的数据集也就越大，超大的模型可以轻松记忆本就为数不多的数据集导致过拟合，必须对扩展模型保持谨慎态度。

当然，这也并不意味着完全就不能使用Transformer架构。Transformer架构在长距离提取上仍然有优势。具体来说，如果想要充分发挥Transformer架构的优势，我们需要解决如下问题：

1. 每个信息单元包含的信息要足够丰富。自然语言中每个单词的语义已经非常丰富，最新的的大语言模型单个词嵌入维度已经达到了4096甚至更高。而单个时间步的OHLCV数据的维度太小，即使扩展一些辅助信息，也很难从单个时间步得到有效信息进行相互传播。因此，单个信息单元要从时间步提升到子序列级别，比如一个长达10天的子序列，除了10天本身的价格信息以外，还能抽象出某种趋势信息，例如一小段缩量上涨、或者一小段区间的放量震荡等等。通过将多个时间步组合成一个patch的方式，模型可以变为处理一段一段的时间。同时，这样的结构也可以接入更长的历史窗口，绕开Transformer在注意力层的O(N^2)复杂度的限制。

2. 用科学的方式扩展训练集，如果我们的目标是资产价格预测，那么至少训练集的范围可以扩展到其他金融资产，但不应该扩展到非金融的领域。因为价格的底层逻辑是供需关系、多空博弈。同时，还要增加额外的机制让模型理解不同资产之间的差距和联系，例如波动率、相关性、协整性等等。



In [1]:
import os
os.chdir('d:/future/Index_Future_Prediction')

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.optim import lr_scheduler, Adam, AdamW
from torch.utils.data import TensorDataset, DataLoader

from utils import *

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [3]:
seq_len = 40
pred_len = 5
train_ratio = 0.5
validation_ratio = 0.2
test_ratio = 0.03

hidden_size = 10
num_layers = 1

In [None]:
# 提取数据
assets_list = ['IH.CFX', 'IF.CFX', 'IC.CFX', 'AU.SHF', 'JM.DCE','RB.SHF','HC.SHF', 'I.DCE', 'M.DCE', 'CF.ZCE',]
assets_list = ['IH.CFX', 'IF.CFX', 'IC.CFX',]

feature_columns = ['inday_chg_open','inday_chg_high','inday_chg_low','inday_chg_close','inday_chg_amplitude', 'ma_10','ma_26','ma_45','ma_90','ma_vol',]
label_columns = ['label_return','down_prob','middle_prob','up_prob']

feature = []
label = []

for asset_code in assets_list:
    data = pd.read_csv(f'data/{asset_code}.csv')
    feature.append(torch.tensor(data[feature_columns].values, dtype = torch.float32, device = 'cuda:0'))
    label.append(torch.tensor(data[label_columns].values, dtype = torch.float32, device = 'cuda:0'))

# 加载数据
feature = torch.stack(feature, dim = 1)
label = torch.stack(label, dim = 1)
print(feature.shape, label.shape)

# 折叠时间步
feature = feature.unfold(dimension = 0, size = seq_len, step = 1).transpose(2,3)
label = label[seq_len-1:]
# 归并品种
feature = torch.flatten(feature, start_dim=0, end_dim = 1)
label = torch.flatten(label, start_dim=0, end_dim = 1)

print(feature.shape, label.shape)
data = RandomLoader(feature, label)
train_loader, test_loader = data(batch_size=100, slice_size=[0.5,0.2], balance=[True, False])

torch.Size([2603, 3, 10]) torch.Size([2603, 3, 4])


(torch.Size([7692, 40, 10]), torch.Size([7692, 4]))

In [7]:
recorder = PredictionRecorder()
animator = TrainMonitor(figsize=(12,6))

In [None]:
class SimplePatch(nn.Module):
    """"
    Simple Patch for RNN
    服务于RNN的时间序列分块，因为RNN不会忽略位置信息，因此不需要嵌入RoPE
    """
    def __init__(self, patch_size):
        super().__init__()
        self.patch_size = patch_size

    def forward(self, x):
        """"
        倒数第二个维度为需要patch的时间步
        """
        # 保存形状并展平前面的层
        front_size = tuple(x.shape[:-2])
        seq_len = x.shape[-2]
        feature_size = x.shape[-1]
        x_rebatch = x.reshape(-1, seq_len, feature_size)


        # 舍去前面无法被整分为patch的的部分
        max_patch = seq_len//self.patch_size
        vaild_seq_len = self.patch_size * max_patch
        x_valid = x_rebatch[:,-vaild_seq_len:,:].clone() # 切片操作破坏了内存连续性，需要复制一份
        x_recover = x_valid.reshape(*front_size, max_patch, self.patch_size, feature_size)

        return x_recover
    
if __name__ == '__main__':
    x = torch.randn(size = (5,7,20,9))
    sp = SimplePatch(6)
    print(sp(x).shape)

torch.Size([5, 7, 3, 6, 9])


In [None]:
class TimeSeriesPatcher(nn.Module):
    """
    将形状为 (*, seq_len, feature) 的tensor重塑为 (*, num_patch, patch_size, feature)
    """

    def __init__(self, patch_size: int, stride: int):
        super().__init__()
        if not isinstance(patch_size, int) or patch_size <= 0:
            raise ValueError("patch_size 必须是一个正整数。")
        if not isinstance(stride, int) or stride <= 0:
            raise ValueError("step 必须是一个正整数。")

        self.patch_size = patch_size
        self.stride = stride

    def forward(self, x) :
        """
        num_patch = floor((seq_len - patch_size) / stride) + 1
        """
        seq_len = x.shape[-2]
        assert seq_len >= self.patch_size, 'patch_size 超过了序列长度'
        patches = x.unfold(dimension=-2, size=self.patch_size, step=self.stride)
        patches = patches.swapaxes(-1, -2)
        return patches


原始张量形状: torch.Size([4, 100, 10])

Patch 大小: 16, 步长: 8
Patch 后的张量形状: <built-in method stride of Tensor object at 0x00000163CC569F40>
预期的 Patch 数量: floor((100 - 16) / 8) + 1 = 11
输出形状是否符合预期? ✅

--- 测试多个引导维度 ---
原始张量形状: torch.Size([2, 3, 100, 10])

Patch 后的张量形状: torch.Size([2, 3, 11, 16, 10])
预期的形状: (2, 3, 11, 16, 10)
输出形状是否符合预期? ✅

--- 测试错误处理 ---
成功捕获到预期的错误: 序列长度 (10) 必须大于或等于 patch_size (16)。 ✅


In [None]:
class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEmbedding, self).__init__()
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model).float()
        pe.require_grad = False

        position = torch.arange(0, max_len).float().unsqueeze(1)
        div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return self.pe[:, :x.size(1)]

In [None]:
class Patch_TST(nn.Module):
    """循环神经网络模型"""
    def __init__(self, input_size, patch_size, in_patch_hidden_size, inpatch_num_layers, hidden_size, num_layers, dropout):
        super().__init__()
        self.device = 'cuda:0'
        self.input_size = input_size
        self.patch_size = patch_size
        self.in_patch_hidden_size = in_patch_hidden_size
        self.hidden_size = hidden_size



        self.simple_patch = SimplePatch(patch_size)

        self.inpatch_process = nn.RNN(
            input_size = input_size,
            hidden_size = in_patch_hidden_size,
            num_layers = inpatch_num_layers,
            dropout = dropout,
            batch_first = True,
            # nonlinearity='relu',
        )
        self.dropout = nn.Dropout(dropout)

        self.process = nn.LSTM(
            input_size = in_patch_hidden_size, #第一层LSTM的隐藏层作为第二层的输入，因此第二层的input size = in_patch_hidden_size
            hidden_size = hidden_size,
            num_layers = num_layers,
            dropout = dropout,
            batch_first = True,
            # nonlinearity='relu',
        )
        
        self.output = nn.Sequential(
            nn.Dropout(dropout),
            HybridDecoder(dim_state = hidden_size, init_prob = [0.0,0.5,0.0])
        )
        

    def forward(self, x):

        x_truncated = self.sequence_truncate(x)
        x_patched = self.simple_patch(x_truncated)
        num_patch = x_patched.shape[-3]
        front_size = tuple(x_patched.shape[:-2])
        x_rebatched = x_patched.reshape(-1, self.patch_size, self.input_size)
        x_processed_1 = self.inpatch_process(x_rebatched)[0][:,-1,:]
        x_recover = x_processed_1.reshape(*front_size, -1)
        front_size_2 = tuple(x_recover.shape[:-2])
        x_rebatched_2 = x_recover.reshape(-1, num_patch,  self.in_patch_hidden_size)
        x_rebatched_2 = self.dropout(x_rebatched_2)
        x_processed_2 = self.process(x_rebatched_2)[0][:,-1,:]
        x_recover_2 = x_processed_2.reshape(*front_size_2, self.hidden_size)
        
        return self.output(x_recover_2)

if __name__ == '__main__':
    x = torch.randn(size = (7,40,9))
    model = Patch_LSTM(input_size = 9, patch_size = 6, in_patch_hidden_size = 12, inpatch_num_layers = 2, hidden_size = 11, num_layers = 3, dropout = 0.5)
    print(model(x).shape)

NameError: name 'Patch_LSTM' is not defined

In [None]:
result = np.zeros(shape = (10, len(assets_list), 4))

for i in range(10):
    j = 0
    train_size = int(0.5*len(feature))
    validation_size = int(0.1*len(feature))
    test_size  = int(0.1*len(feature))

    split = np.random.randint(train_size, len(feature) - validation_size - test_size)

    train_set = TensorDataset(feature[:split], label[:split])
    balance_sampler = BalancedSampler(label[:split], 128)
    train_loader = DataLoader(train_set, batch_sampler=balance_sampler)

    validation_set = TensorDataset(feature[split:split+validation_size], label[split:split+validation_size])
    validation_loader = DataLoader(validation_set, batch_size=100)

    test_set = TensorDataset(feature[split+validation_size:split+validation_size+test_size], label[split+validation_size:split+validation_size+test_size])
    test_loader = DataLoader(test_set, batch_size=100)

    
    animator.reset()
    loss_fn = HybridLoss(alpha = 5e-2, delta = 1.3, show_loss = False)
    model = Patch_LSTM(input_size = 10, patch_size = 10, in_patch_hidden_size = 7, inpatch_num_layers = 1, hidden_size = 7, num_layers = 1, dropout = 0.4).to('cuda:0')
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-3, weight_decay = 1e-1)
    scheduler = lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.95)
    train = ModelTrain(model = model,
                   train_loader = train_loader,
                   validation_loader = validation_loader,
                   test_loader = test_loader,
                   loss_fn = loss_fn,
                   optimizer = optimizer,
                   scheduler = scheduler,
                   recorder = recorder,
                   graph = animator,
                   )
    prediction, precision = train.epoch_train(epochs = 30, early_stop = 100)

    result[i,j,0] = prediction
    result[i,j,1] = precision

In [None]:
all_assets = pd.DataFrame({
    'stage_1_prediction': np.mean(result, axis = 0)[:,0],
    'stage_2_prediction': np.mean(result, axis = 0)[:,2],

    'stage_1_precision': np.mean(result, axis = 0)[:,1],
    'stage_2_precision': np.mean(result, axis = 0)[:,3],

    'stage_1_precision_std': np.std(result, axis = 0)[:,1],
    'stage_2_precision_std': np.std(result, axis = 0)[:,3],
})
all_assets.index = pd.Series(assets_list)
for col in all_assets.columns:
    all_assets[col] = all_assets[col].apply(lambda x: f"{x:.1%}")

# 转换为Markdown
markdown_table = all_assets.to_markdown(index=False)
print(f'hidden_size: {hidden_size}, num_layers: {num_layers}, seq_len: {seq_len}')
print(markdown_table)