LP-Mirror is a data selection method designed for time series forecasting tasks. It filters training batches to identify and retain high-quality samples. The code is compatible with mainstream time series datasets and forecasting frameworks. LP-Mirror 是一个专用于时间序列预测任务的数据筛选方法。它通过筛选训练批次来识别并保留高质量样本。该代码适配主流时间序列数据,可直接用于现有的预测框架中。
You can customize and run the shell scripts to execute the selection process. Please refer to run_selection.py or run_cross_selection.py for implementation details.
您可以修改并运行 shell 脚本来执行筛选过程,具体实现细节请参考 run_selection.py or run_cross_selection.py。
⚠️ Note / 注意 Since the effective dataset size will decrease after selection, hyperparameter tuning (especially learning rate or batch size) is necessary to prevent overfitting or underfitting. 由于筛选后有效数据量会减少,对模型进行适当的参数调整(特别是学习率或 Batch Size)是必要的,以防止过拟合或欠拟合。
This guide explains how to modify the source code of mainstream time series forecasting frameworks (e.g., Time-Series-Library, Autoformer, Informer) to support loading high-quality data indices (.npy files) generated by LP-Mirror.
本指南说明了如何修改主流时间序列预测框架(如 Time-Series-Library, Autoformer 等)的源码,以支持加载由 LP-Mirror 算法筛选出的高质量数据索引(.npy 文件)。
By following these steps, you can train your model using only the "effective samples" while keeping the original data preprocessing logic (e.g., Standardization) intact. 通过以下修改,您可以在保持原始数据预处理逻辑(如标准化)不变的前提下,仅使用筛选后的“有效样本”进行模型训练。
Add a new command-line argument to receive the path of the selection result file. 我们需要在主运行脚本中添加一个新的命令行参数,用于接收筛选结果文件的路径。
File / 文件: ./run_longExp.py
Locate the argument definition section (usually under if __name__ == '__main__':) and add the --data_selection_path argument:
找到参数定义部分(通常在 if __name__ == '__main__': 下),添加 --data_selection_path 参数:
import argparse
# ... other imports ...
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Autoformer & Transformer family for Time Series Forecasting')
# ... existing arguments (e.g., --root_path, --data_path) ...
parser.add_argument('--patience', type=int, default=3, help='early stopping patience')
parser.add_argument('--learning_rate', type=float, default=0.0001, help='optimizer learning rate')
# ================= [Added Code / 新增代码] =================
# Path to the .npy file containing selected indices. Default is None (use full data).
# 添加数据筛选路径参数,默认为 None(即使用全部数据)
parser.add_argument('--data_selection_path', type=str, default=None, help='Path to the .npy file containing selected indices')
# =========================================================
args = parser.parse_args()
# ... following code ...This is the core modification. We need to intercept the dataset after instantiation and prune it using the Subset class based on the indices.
这是核心修改部分。我们需要在数据加载器构建数据时,拦截训练集,并利用 Subset 类根据索引进行裁剪。
File / 文件: ./data_provider/data_factory.py
Add the following imports at the top of the file: 在文件头部添加以下导入:
from data_loader import Dataset_ETT_hour, Dataset_ETT_minute, Dataset_Custom, Dataset_Pred
from torch.utils.data import DataLoader
# ================= [Added Imports / 新增导入] =================
from torch.utils.data import Subset # For creating data subsets / 用于创建数据子集
import numpy as np # For loading .npy files / 用于加载 .npy 文件
import os # For checking path existence / 用于检查路径存在性
# ============================================================In the data_provider function, add the filtering logic after the Dataset is instantiated:
在 data_provider 函数中,在实例化 Dataset 之后,添加筛选逻辑:
def data_provider(args, flag):
Data = data_dict[args.data]
# ... (omit timeenc setup) ...
# ... (omit shuffle_flag, batch_size setup) ...
# 1. Instantiate the dataset normally
# Note: It is crucial to fit the StandardScaler on the FULL dataset first to ensure consistency.
# 注意:这一步非常重要!必须先实例化完整数据集,确保 StandardScaler 是在完整数据上 fit 的。
data_set = Data(
root_path=args.root_path,
data_path=args.data_path,
flag=flag,
size=[args.seq_len, args.label_len, args.pred_len],
features=args.features,
target=args.target,
timeenc=timeenc,
freq=freq
)
# ================= [Added Logic: Data Selection / 新增代码: 数据筛选逻辑] =================
# Execute only in 'train' mode and when a selection path is provided
# 仅在 'train' 模式且传入了筛选路径时执行
if flag == 'train' and hasattr(args, 'data_selection_path') and args.data_selection_path is not None:
if os.path.exists(args.data_selection_path):
print(f"\n>>>>>> [LP-Mirror] Loading selected data indices from: {args.data_selection_path}")
try:
# 1. Load selected indices / 加载筛选出的索引
selected_indices = np.load(args.data_selection_path)
# 2. Record original size / 记录原始数据量
original_len = len(data_set)
# 3. Apply Subset to prune data / 使用 Subset 进行裁剪
data_set = Subset(data_set, selected_indices)
print(f">>>>>> [LP-Mirror] Selection Applied.")
print(f" Original Size: {original_len} -> Selected Size: {len(data_set)}")
print(f" Retention Ratio: {len(data_set)/original_len*100:.2f}%\n")
except Exception as e:
print(f">>>>>> [Error] Failed to load indices: {e}. Training on FULL dataset.")
else:
print(f"\n>>>>>> [Warning] Path not found: {args.data_selection_path}. Training on FULL dataset.\n")
# ====================================================================================
# Create DataLoader normally
data_loader = DataLoader(
data_set,
batch_size=batch_size,
# ... existing params ...
)
return data_set, data_loaderAfter completing the modifications above, follow these steps to run your experiment: 完成上述修改后,您可以按照以下流程进行实验:
Run your selection script (e.g., run_selection.py) to generate the index file.
先运行您的筛选脚本(例如 run_selection.py),生成索引文件。
python run_selection.py --data_path ETTh1.csv --save_path ./results/selection_ETTh1/
# Output: ./results/selection_ETTh1/selected_indices.npy
When running run_longExp.py, point to the generated .npy file using the --data_selection_path argument.
在运行 run_longExp.py 时,通过 --data_selection_path 参数指向刚才生成的 .npy 文件。
Example / 示例命令:
python -u run_longExp.py \
--is_training 1 \
--root_path ./dataset/ \
--data_path ETTh1.csv \
--model_id ETTh1_96_96 \
--model TimesNet \
--data ETTh1 \
--features M \
--seq_len 96 \
--label_len 48 \
--pred_len 96 \
--data_selection_path ./results/selection_ETTh1/selected_indices.npy
If integrated successfully, you will see output similar to the following in your console: 如果集成成功,您将在控制台输出中看到类似以下的信息:
>>>>>> [LP-Mirror] Loading selected data indices from: ./results/selection_ETTh1/selected_indices.npy
>>>>>> [LP-Mirror] Selection Applied.
Original Size: 8545 -> Selected Size: 5981
Retention Ratio: 70.00%
This indicates that the model is now training exclusively on the "High Quality" data verified by LP-Mirror. 这意味着模型现在只使用经过 LP-Mirror 算法验证的“高质量”数据进行训练。