Skip to content

LP-Mirror/LP_Mirror

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LP-Mirror: Data Selection Framework for Time Series Forecasting

LP-Mirror: 时间序列预测数据筛选框架

LP-Mirror is a data selection method designed for time series forecasting tasks. It filters training batches to identify and retain high-quality samples. The code is compatible with mainstream time series datasets and forecasting frameworks. LP-Mirror 是一个专用于时间序列预测任务的数据筛选方法。它通过筛选训练批次来识别并保留高质量样本。该代码适配主流时间序列数据,可直接用于现有的预测框架中。

You can customize and run the shell scripts to execute the selection process. Please refer to run_selection.py or run_cross_selection.py for implementation details. 您可以修改并运行 shell 脚本来执行筛选过程,具体实现细节请参考 run_selection.py or run_cross_selection.py

⚠️ Note / 注意 Since the effective dataset size will decrease after selection, hyperparameter tuning (especially learning rate or batch size) is necessary to prevent overfitting or underfitting. 由于筛选后有效数据量会减少,对模型进行适当的参数调整(特别是学习率或 Batch Size)是必要的,以防止过拟合或欠拟合。


🛠️ Integration Guide / 集成指南

This guide explains how to modify the source code of mainstream time series forecasting frameworks (e.g., Time-Series-Library, Autoformer, Informer) to support loading high-quality data indices (.npy files) generated by LP-Mirror. 本指南说明了如何修改主流时间序列预测框架(如 Time-Series-Library, Autoformer 等)的源码,以支持加载由 LP-Mirror 算法筛选出的高质量数据索引(.npy 文件)。

By following these steps, you can train your model using only the "effective samples" while keeping the original data preprocessing logic (e.g., Standardization) intact. 通过以下修改,您可以在保持原始数据预处理逻辑(如标准化)不变的前提下,仅使用筛选后的“有效样本”进行模型训练。

Step 1: Modify run_longExp.py / 修改主运行脚本

Add a new command-line argument to receive the path of the selection result file. 我们需要在主运行脚本中添加一个新的命令行参数,用于接收筛选结果文件的路径。

File / 文件: ./run_longExp.py

Locate the argument definition section (usually under if __name__ == '__main__':) and add the --data_selection_path argument: 找到参数定义部分(通常在 if __name__ == '__main__': 下),添加 --data_selection_path 参数:

import argparse
# ... other imports ...

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Autoformer & Transformer family for Time Series Forecasting')

    # ... existing arguments (e.g., --root_path, --data_path) ...
    parser.add_argument('--patience', type=int, default=3, help='early stopping patience')
    parser.add_argument('--learning_rate', type=float, default=0.0001, help='optimizer learning rate')

    # ================= [Added Code / 新增代码] =================
    # Path to the .npy file containing selected indices. Default is None (use full data).
    # 添加数据筛选路径参数,默认为 None(即使用全部数据)
    parser.add_argument('--data_selection_path', type=str, default=None, help='Path to the .npy file containing selected indices')
    # =========================================================

    args = parser.parse_args()
    # ... following code ...

Step 2: Modify data_factory.py / 修改数据工厂

This is the core modification. We need to intercept the dataset after instantiation and prune it using the Subset class based on the indices. 这是核心修改部分。我们需要在数据加载器构建数据时,拦截训练集,并利用 Subset 类根据索引进行裁剪。

File / 文件: ./data_provider/data_factory.py

2.1 Add Imports / 添加导入

Add the following imports at the top of the file: 在文件头部添加以下导入:

from data_loader import Dataset_ETT_hour, Dataset_ETT_minute, Dataset_Custom, Dataset_Pred
from torch.utils.data import DataLoader

# ================= [Added Imports / 新增导入] =================
from torch.utils.data import Subset  # For creating data subsets / 用于创建数据子集
import numpy as np                   # For loading .npy files / 用于加载 .npy 文件
import os                            # For checking path existence / 用于检查路径存在性
# ============================================================

2.2 Modify Logic in data_provider / 修改 data_provider 逻辑

In the data_provider function, add the filtering logic after the Dataset is instantiated: 在 data_provider 函数中,在实例化 Dataset 之后,添加筛选逻辑:

def data_provider(args, flag):
    Data = data_dict[args.data]
    # ... (omit timeenc setup) ...
    # ... (omit shuffle_flag, batch_size setup) ...

    # 1. Instantiate the dataset normally
    # Note: It is crucial to fit the StandardScaler on the FULL dataset first to ensure consistency.
    # 注意:这一步非常重要!必须先实例化完整数据集,确保 StandardScaler 是在完整数据上 fit 的。
    data_set = Data(
        root_path=args.root_path,
        data_path=args.data_path,
        flag=flag,
        size=[args.seq_len, args.label_len, args.pred_len],
        features=args.features,
        target=args.target,
        timeenc=timeenc,
        freq=freq
    )

    # ================= [Added Logic: Data Selection / 新增代码: 数据筛选逻辑] =================
    # Execute only in 'train' mode and when a selection path is provided
    # 仅在 'train' 模式且传入了筛选路径时执行
    if flag == 'train' and hasattr(args, 'data_selection_path') and args.data_selection_path is not None:
        if os.path.exists(args.data_selection_path):
            print(f"\n>>>>>> [LP-Mirror] Loading selected data indices from: {args.data_selection_path}")
            try:
                # 1. Load selected indices / 加载筛选出的索引
                selected_indices = np.load(args.data_selection_path)
                
                # 2. Record original size / 记录原始数据量
                original_len = len(data_set)
                
                # 3. Apply Subset to prune data / 使用 Subset 进行裁剪
                data_set = Subset(data_set, selected_indices)
                
                print(f">>>>>> [LP-Mirror] Selection Applied.")
                print(f"       Original Size: {original_len} -> Selected Size: {len(data_set)}")
                print(f"       Retention Ratio: {len(data_set)/original_len*100:.2f}%\n")
            except Exception as e:
                print(f">>>>>> [Error] Failed to load indices: {e}. Training on FULL dataset.")
        else:
            print(f"\n>>>>>> [Warning] Path not found: {args.data_selection_path}. Training on FULL dataset.\n")
    # ====================================================================================

    # Create DataLoader normally
    data_loader = DataLoader(
        data_set,
        batch_size=batch_size,
        # ... existing params ...
    )
    return data_set, data_loader

🚀 Usage / 使用方法

After completing the modifications above, follow these steps to run your experiment: 完成上述修改后,您可以按照以下流程进行实验:

Step 1: Run Selection Script / 运行筛选脚本

Run your selection script (e.g., run_selection.py) to generate the index file. 先运行您的筛选脚本(例如 run_selection.py),生成索引文件。

python run_selection.py --data_path ETTh1.csv --save_path ./results/selection_ETTh1/
# Output: ./results/selection_ETTh1/selected_indices.npy

Step 2: Run Main Training Script / 运行主训练脚本

When running run_longExp.py, point to the generated .npy file using the --data_selection_path argument. 在运行 run_longExp.py 时,通过 --data_selection_path 参数指向刚才生成的 .npy 文件。

Example / 示例命令:

python -u run_longExp.py \
  --is_training 1 \
  --root_path ./dataset/ \
  --data_path ETTh1.csv \
  --model_id ETTh1_96_96 \
  --model TimesNet \
  --data ETTh1 \
  --features M \
  --seq_len 96 \
  --label_len 48 \
  --pred_len 96 \
  --data_selection_path ./results/selection_ETTh1/selected_indices.npy

Result Verification / 结果验证

If integrated successfully, you will see output similar to the following in your console: 如果集成成功,您将在控制台输出中看到类似以下的信息:

>>>>>> [LP-Mirror] Loading selected data indices from: ./results/selection_ETTh1/selected_indices.npy
>>>>>> [LP-Mirror] Selection Applied.
       Original Size: 8545 -> Selected Size: 5981
       Retention Ratio: 70.00%

This indicates that the model is now training exclusively on the "High Quality" data verified by LP-Mirror. 这意味着模型现在只使用经过 LP-Mirror 算法验证的“高质量”数据进行训练。

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages