# 表征模型使用教程

表征模型(TS2Vec)属于自监督模型里的一种，主要是希望能够学习到一种通用的特征表达用于下游任务；当前主流的自监督学习主要有基于生成式和基于对比学习的方法，当前案例使用的TS2Vec模型是一种基于对比学习的自监督模型

自监督模型的使用一般分为两个阶段：
1. 不涉及任何下游任务，使用无标签的数据进行预训练
2. 使用带标签的数据在下游任务上 Fine-tune

TS2Vec结合下游任务的使用同样遵循自监督模型的使用范式，分为2个阶段：
1. 表征模型训练
2. 将表征模型的输出用于下游任务(当前案例的下游任务为预测任务)

为兼顾初学者和有一定的经验的开发者，本文给出两种表征任务的使用方法：
1. 表征模型和下游任务相结合的pipeline，对初学者的使用非常友好
2. 表征模型和下游任务解耦，详细展示表征模型和下游任务是如何相结合使用

# 使用方法一：表征模型和下游任务相结合的pipeline

# 准备数据集

In [1]:
import numpy as np
np.random.seed(2022)
import pandas as pd
import matplotlib.pyplot as plt

import paddle
paddle.seed(2022)

from paddlets.models.representation.dl.ts2vec import TS2Vec
from paddlets.datasets.repository import get_dataset
from paddlets.models.representation.task.repr_forecasting import ReprForecasting

data = get_dataset('ETTh1')
data, _ = data.split('2016-09-22 06:00:00')
train_data, test_data = data.split('2016-09-21 05:00:00')
train_data

  from collections import defaultdict, Iterable


                            OT    HUFL   HULL    MUFL   MULL   LUFL   LULL
date                                                                      
2016-07-01 00:00:00  30.531000   5.827  2.009   1.599  0.462  4.203  1.340
2016-07-01 01:00:00  27.787001   5.693  2.076   1.492  0.426  4.142  1.371
2016-07-01 02:00:00  27.787001   5.157  1.741   1.279  0.355  3.777  1.218
2016-07-01 03:00:00  25.044001   5.090  1.942   1.279  0.391  3.807  1.279
2016-07-01 04:00:00  21.948000   5.358  1.942   1.492  0.462  3.868  1.279
...                        ...     ...    ...     ...    ...    ...    ...
2016-09-21 01:00:00  21.878000  13.396  4.354  11.940  3.198  1.310  0.670
2016-09-21 02:00:00  22.230000  12.458  4.354  11.407  2.878  1.127  0.579
2016-09-21 03:00:00  22.230000  12.927  4.086  11.655  2.878  1.127  0.579
2016-09-21 04:00:00  22.722000  12.324  6.162  11.407  4.655  1.340  0.609
2016-09-21 05:00:00  22.511000  14.133  6.497  12.650  5.082  1.614  0.670

[1974 rows x 7 columns]

In [10]:
train_data.get_observed_cov()

                       HUFL   HULL   LUFL   LULL    MUFL   MULL
date                                                           
2016-07-01 00:00:00   5.827  2.009  4.203  1.340   1.599  0.462
2016-07-01 01:00:00   5.693  2.076  4.142  1.371   1.492  0.426
2016-07-01 02:00:00   5.157  1.741  3.777  1.218   1.279  0.355
2016-07-01 03:00:00   5.090  1.942  3.807  1.279   1.279  0.391
2016-07-01 04:00:00   5.358  1.942  3.868  1.279   1.492  0.462
...                     ...    ...    ...    ...     ...    ...
2016-09-21 01:00:00  13.396  4.354  1.310  0.670  11.940  3.198
2016-09-21 02:00:00  12.458  4.354  1.127  0.579  11.407  2.878
2016-09-21 03:00:00  12.927  4.086  1.127  0.579  11.655  2.878
2016-09-21 04:00:00  12.324  6.162  1.340  0.609  11.407  4.655
2016-09-21 05:00:00  14.133  6.497  1.614  0.670  12.650  5.082

[1974 rows x 6 columns]

# 模型训练

In [2]:
ts2vec_params = { "segment_size": 200, 
                  "repr_dims": 320,
                  "batch_size": 32,
                        "sampling_stride": 200,
                         "max_epochs": 20}
model = ReprForecasting(in_chunk_len=200,
                                out_chunk_len=24,
                                sampling_stride=1,
                                repr_model=TS2Vec,
                                repr_model_params=ts2vec_params)
model.fit(train_data)

[2022-11-02 18:49:52,748] [paddlets.models.representation.task.repr_forecasting] [INFO] Repr model fit start
W1102 18:49:52.935454 88921 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2
W1102 18:49:52.938324 88921 gpu_context.cc:306] device: 0, cuDNN Version: 8.5.
  format(lhs_dtype, rhs_dtype, lhs_dtype))
[2022-11-02 18:49:59,383] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 000| loss: 1414.767578| 0:00:01s
[2022-11-02 18:49:59,575] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 001| loss: 803.826904| 0:00:01s
[2022-11-02 18:49:59,738] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 002| loss: 471.702454| 0:00:01s
[2022-11-02 18:49:59,895] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 003| loss: 260.660858| 0:00:01s
[2022-11-02 18:50:00,052] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 004| loss: 189.227371| 0:00:02s
[2022-11-02 18:50:00,223] [paddlets

# 预测

In [4]:
model.predict(train_data)

                            OT
2016-09-21 06:00:00  22.991394
2016-09-21 07:00:00  23.517416
2016-09-21 08:00:00  24.104179
2016-09-21 09:00:00  24.243126
2016-09-21 10:00:00  24.298140
2016-09-21 11:00:00  24.367647
2016-09-21 12:00:00  24.640148
2016-09-21 13:00:00  24.640419
2016-09-21 14:00:00  24.576252
2016-09-21 15:00:00  24.295414
2016-09-21 16:00:00  23.657946
2016-09-21 17:00:00  23.767244
2016-09-21 18:00:00  23.669827
2016-09-21 19:00:00  23.264309
2016-09-21 20:00:00  22.973028
2016-09-21 21:00:00  22.930428
2016-09-21 22:00:00  22.845171
2016-09-21 23:00:00  22.806917
2016-09-22 00:00:00  22.769144
2016-09-22 01:00:00  23.296446
2016-09-22 02:00:00  23.689632
2016-09-22 03:00:00  24.013086
2016-09-22 04:00:00  23.938864
2016-09-22 05:00:00  23.524876

# 内置的API: `backtest`可用于预测与评估

In [4]:
from paddlets.utils.backtest import backtest
score, predicts = backtest(
            data,
            model, 
            start="2016-09-21 06:00:00", 
            predict_window=24, 
            stride=24,
            return_predicts=True)

  from .autonotebook import tqdm as notebook_tqdm
Likely causes:
  * OpenMP runtime is not installed (vcomp140.dll or libgomp-1.dll for Windows, libomp.dylib for Mac OSX, libgomp.so for Linux and other UNIX-like OSes). Mac OSX users: Run `brew install libomp` to install OpenMP runtime.
  * You are running 32-bit Python on a 64-bit OS
Error message(s): ['dlopen: cannot load any more object with static TLS']

Backtest Progress: 100%|██████████| 2/2 [00:00<00:00,  3.68it/s]


# 使用方法二：表征模型和下游回归任务解耦

# 第一阶段：
1.表征模型的训练

2.输出训练集和测试集的表征结果

# 准备数据集

In [5]:
import numpy as np
import pandas as pd

from paddlets.models.representation.dl.ts2vec import TS2Vec
from paddlets.datasets.repository import get_dataset

data = get_dataset('ETTh1')
data, _ = data.split('2016-09-22 06:00:00')
train_data, test_data = data.split('2016-09-21 05:00:00')
train_data

                            OT    HUFL   HULL    MUFL   MULL   LUFL   LULL
date                                                                      
2016-07-01 00:00:00  30.531000   5.827  2.009   1.599  0.462  4.203  1.340
2016-07-01 01:00:00  27.787001   5.693  2.076   1.492  0.426  4.142  1.371
2016-07-01 02:00:00  27.787001   5.157  1.741   1.279  0.355  3.777  1.218
2016-07-01 03:00:00  25.044001   5.090  1.942   1.279  0.391  3.807  1.279
2016-07-01 04:00:00  21.948000   5.358  1.942   1.492  0.462  3.868  1.279
...                        ...     ...    ...     ...    ...    ...    ...
2016-09-21 01:00:00  21.878000  13.396  4.354  11.940  3.198  1.310  0.670
2016-09-21 02:00:00  22.230000  12.458  4.354  11.407  2.878  1.127  0.579
2016-09-21 03:00:00  22.230000  12.927  4.086  11.655  2.878  1.127  0.579
2016-09-21 04:00:00  22.722000  12.324  6.162  11.407  4.655  1.340  0.609
2016-09-21 05:00:00  22.511000  14.133  6.497  12.650  5.082  1.614  0.670

[1974 rows x 7 columns]

# 表征模型训练

In [6]:
#实例化TS2Vect对象
ts2vec = TS2Vec(
    segment_size=200, #最大序列长度
    repr_dims=320, #表征输出的维度大小
    batch_size=32,
    max_epochs=20,
)
#训练
ts2vec.fit(train_data)

  format(lhs_dtype, rhs_dtype, lhs_dtype))
[2022-11-02 11:50:20,539] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 000| loss: 78.309945| 0:00:10s
[2022-11-02 11:50:30,399] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 001| loss: 4.548435| 0:00:19s
[2022-11-02 11:50:40,499] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 002| loss: 3.869072| 0:00:30s
[2022-11-02 11:50:50,391] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 003| loss: 3.646648| 0:00:39s
[2022-11-02 11:51:00,268] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 004| loss: 3.551296| 0:00:49s
[2022-11-02 11:51:10,484] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 005| loss: 3.591498| 0:01:00s
[2022-11-02 11:51:20,490] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 006| loss: 3.517901| 0:01:10s
[2022-11-02 11:51:30,249] [paddlets.models.common.callbacks.callbacks] [INFO] epoch 007| loss: 3.508513| 0:01:19s
[2022-11-02 11:51:39,716] [paddlets.models.c

# 输出训练集和测试集的表征结果

In [7]:
sliding_len = 200 
all_reprs = ts2vec.encode(data, sliding_len=sliding_len) 
split_tag = len(train_data['OT'])
train_reprs = all_reprs[:, :split_tag]
test_reprs = all_reprs[:, split_tag:]

100%|██████████| 1999/1999 [00:02<00:00, 958.44it/s]


# 第二阶段
1. 构建回归模型的训练和测试样本

2. 训练和预测

# 构建回归模型的训练和测试样本

In [8]:
def generate_pred_samples(features, data, pred_len, drop=0):
    n = data.shape[1]
    features = features[:, :-pred_len]
    labels = np.stack([ data[:, i:1+n+i-pred_len] for i in range(pred_len)], axis=2)[:, 1:]
    features = features[:, drop:]
    labels = labels[:, drop:]
    return features.reshape(-1, features.shape[-1]), \
            labels.reshape(-1, labels.shape[2]*labels.shape[3])

pre_len = 24 #预测未来时刻的长度

#构建训练样本
train_to_numpy = train_data.to_numpy()
train_to_numpy = np.expand_dims(train_to_numpy, 0) #保持和encode输出的维度一致
train_features, train_labels = generate_pred_samples(train_reprs, train_to_numpy, pre_len, drop=sliding_len)

#构建测试样本
test_to_numpy = test_data.to_numpy()
test_to_numpy = np.expand_dims(test_to_numpy, 0) #同上
test_features, test_labels = generate_pred_samples(test_reprs, test_to_numpy, pre_len) #构造样本

# 训练及预测

In [9]:
#训练
from sklearn.linear_model import Ridge
lr = Ridge(alpha=0.1)
lr.fit(train_features, train_labels)

#预测
test_pred = lr.predict(test_features)

In [10]:
test_pred

array([[23.411926  , 14.156788  ,  5.5864105 ,  2.0591881 ,  0.8507492 ,
        12.033666  ,  3.8670216 , 24.307272  , 12.970997  ,  5.047309  ,
         1.5718118 ,  0.83765304, 11.449925  ,  3.6241622 , 25.072916  ,
        11.619985  ,  4.647399  ,  1.2839303 ,  0.8075943 , 10.430354  ,
         3.267183  , 25.813192  , 10.340514  ,  4.2520375 ,  1.0931191 ,
         0.77652   ,  9.312     ,  3.038098  , 25.522808  , 10.041627  ,
         4.0208244 ,  1.4567091 ,  0.779971  ,  8.682292  ,  2.5960407 ,
        25.318005  ,  9.248054  ,  4.045926  ,  1.7158421 ,  0.82549214,
         7.7777796 ,  2.616672  , 25.033335  ,  8.437286  ,  3.899671  ,
         1.6119858 ,  0.821985  ,  7.0049305 ,  2.5203104 , 24.626888  ,
         8.839753  ,  4.105016  ,  1.7098855 ,  0.82516253,  7.3333654 ,
         2.8516943 , 24.77245   ,  9.711424  ,  4.28503   ,  2.0725615 ,
         0.9125666 ,  7.6130066 ,  2.8810837 , 24.521465  , 10.09363   ,
         4.5307975 ,  2.1585398 ,  0.936623  ,  7.9