# 数据处理示例代码

## 一、简单处理

- 固定长度的滑动窗口
- 划分数据集

In [2]:
import numpy as np
import pandas as pd

### 读取原数据

In [3]:
data_df = pd.read_csv('./dataset/Krakow-airquality/april-2017.csv')
data_df

Unnamed: 0,UTC time,3_temperature,3_humidity,3_pressure,3_pm1,3_pm25,3_pm10,140_temperature,140_humidity,140_pressure,...,857_pressure,857_pm1,857_pm25,857_pm10,895_temperature,895_humidity,895_pressure,895_pm1,895_pm25,895_pm10
0,2017-04-01T00:00:00,,,,,,,6,92,101906,...,,,,,,,,,,
1,2017-04-01T01:00:00,,,,,,,6,92,101869,...,,,,,,,,,,
2,2017-04-01T02:00:00,,,,,,,5,94,101837,...,,,,,,,,,,
3,2017-04-01T03:00:00,,,,,,,5,92,101834,...,,,,,,,,,,
4,2017-04-01T04:00:00,,,,,,,4,94,101832,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
715,2017-04-30T19:00:00,,,,,,,11,82,102037,...,,,,,11.0,69.0,102118.0,31.0,36.0,56.0
716,2017-04-30T20:00:00,,,,,,,10,88,102029,...,,,,,11.0,75.0,102113.0,35.0,41.0,65.0
717,2017-04-30T21:00:00,,,,,,,9,92,102024,...,,,,,10.0,76.0,102118.0,44.0,53.0,83.0
718,2017-04-30T22:00:00,,,,,,,9,96,102002,...,,,,,10.0,77.0,102094.0,44.0,54.0,84.0


下面是一段比较长的时间序列。

In [4]:
long_seq = data_df[['UTC time', '140_temperature']].dropna()
long_seq

Unnamed: 0,UTC time,140_temperature
0,2017-04-01T00:00:00,6
1,2017-04-01T01:00:00,6
2,2017-04-01T02:00:00,5
3,2017-04-01T03:00:00,5
4,2017-04-01T04:00:00,4
...,...,...
715,2017-04-30T19:00:00,11
716,2017-04-30T20:00:00,10
717,2017-04-30T21:00:00,9
718,2017-04-30T22:00:00,9


使用固定长度（如12）的滑动窗口，将其处理为几条短序列。

In [18]:
long_seq = long_seq['140_temperature']
window_size = 12
short_seqs = []
for i in range(long_seq.shape[0] - window_size + 1):
    short_seqs.append(long_seq.iloc[i:i+window_size].values.tolist())
short_seqs = np.array(short_seqs)
print(short_seqs.shape)

(709, 12)


如果需要划分训练集/验证集/测试集，请首先对完整、有序的原始长序列按比例划分，再分别进行滑动窗口，而不是首先用滑动窗口生成多条短序列，再划分短序列。

In [19]:
train_set_proportion, val_set_proportion = 0.6, 0.2
total_len = long_seq.shape[0]
train_val_split = int(total_len * train_set_proportion)
val_test_split = int(total_len * (train_set_proportion + val_set_proportion))
train_seq, val_seq, test_seq = long_seq[:train_val_split],\
                               long_seq[train_val_split:val_test_split],\
                               long_seq[val_test_split:]

train_set = []
for i in range(train_seq.shape[0] - window_size):
    train_set.append(train_seq.iloc[i:i+window_size].tolist())
train_set = np.array(train_set)
print(train_set.shape)

(420, 12)


## 二、高级处理

- 序列重采样
- 固定时间跨度滑窗
- 不等长序列填充&打包

### 序列重采样

另一种方式是将原始长序列中缺失的时间戳补全。使用pandas的函数可以比较轻松实现这一功能。默认情况下，缺失的时间点会被填充nan。

下面是一段长序列，然而，其时间轴并不规整，某些点之间空缺了几小时的数据。

In [8]:
data_df['UTC time'] = pd.to_datetime(data_df['UTC time'])
data_df = data_df.set_index('UTC time').sort_index()
long_seq = data_df['140_temperature']
long_seq

UTC time
2017-04-01 00:00:00     6
2017-04-01 01:00:00     6
2017-04-01 02:00:00     5
2017-04-01 03:00:00     5
2017-04-01 04:00:00     4
                       ..
2017-04-30 19:00:00    11
2017-04-30 20:00:00    10
2017-04-30 21:00:00     9
2017-04-30 22:00:00     9
2017-04-30 23:00:00     8
Name: 140_temperature, Length: 720, dtype: int64

取出索引中时间的最大最小值，并使用pd.date_range()函数定义一段完整的时间索引

In [9]:
start_time, end_time = long_seq.index.min(), long_seq.index.max()
full_index = pd.date_range(start_time, end_time, freq='h')
reindex_seq = long_seq.reindex(full_index)

可以观察到，定义的完整索引是一个类型为datatime64的索引串

In [10]:
full_index

DatetimeIndex(['2017-04-01 00:00:00', '2017-04-01 01:00:00',
               '2017-04-01 02:00:00', '2017-04-01 03:00:00',
               '2017-04-01 04:00:00', '2017-04-01 05:00:00',
               '2017-04-01 06:00:00', '2017-04-01 07:00:00',
               '2017-04-01 08:00:00', '2017-04-01 09:00:00',
               ...
               '2017-04-30 14:00:00', '2017-04-30 15:00:00',
               '2017-04-30 16:00:00', '2017-04-30 17:00:00',
               '2017-04-30 18:00:00', '2017-04-30 19:00:00',
               '2017-04-30 20:00:00', '2017-04-30 21:00:00',
               '2017-04-30 22:00:00', '2017-04-30 23:00:00'],
              dtype='datetime64[ns]', length=720, freq='H')

### 填充与打包

我们可以通过使用固定时间跨度的滑动窗口使得生成的序列更合理。一般来说，这需要借助pandas DataFrame带时间戳的index。

In [6]:
window_size = 12  # (小时)
short_seqs = []
start_time, end_time = long_seq.index.min(), long_seq.index.max() - pd.Timedelta(window_size, 'h')
cur_time = start_time
while cur_time < end_time:
    short_seqs.append(long_seq.loc[cur_time:cur_time + pd.Timedelta(window_size-1, 'h')].tolist())
    cur_time += pd.Timedelta(1, 'h')

然而这会导致序列的长度不一致，无法处理为Tensor。

In [7]:
seq_lengths = [len(short_seq) for short_seq in short_seqs]
print('Minimum length:', min(seq_lengths))
print('Maximum length:', max(seq_lengths))

Minimum length: 3
Maximum length: 12


首先使用python自带的itertools中的函数，填充序列，使得所有序列的长度等于最长的序列。

In [8]:
from itertools import zip_longest

padded_seqs = np.array(list(zip_longest(*short_seqs, fillvalue=0))).transpose()
padded_seqs.shape

(659, 12)

在pytorch中使用时，可以使用函数将序列打包，使得pytorch能够合理地处理不等长的序列；即，被填充的部分不会实际输入到模型中。当然，这个函数需要手动输入每条序列的长度。

In [9]:
import torch
from torch.nn.utils.rnn import pack_padded_sequence

# 此序列可以直接输入torch封装好的RNN、GRU和LSTM。
packed_seqs = pack_padded_sequence(torch.tensor(padded_seqs), seq_lengths, 
                                   batch_first=True, enforce_sorted=False)

### 序列重采样

另一种方式是将原始长序列中缺失的时间戳补全。使用pandas的函数可以比较轻松实现这一功能。默认情况下，缺失的时间点会被填充nan。

In [10]:
start_time, end_time = long_seq.index.min(), long_seq.index.max() - pd.Timedelta(window_size, 'h')
full_index = pd.date_range(start_time, end_time, freq='h')
reindex_seq = long_seq.reindex(full_index)

随后我们可以使用插分函数，借助已有的数据，将空缺的数据补全。较常用的插分方法是线性插分。

In [11]:
inter_seq = reindex_seq.interpolate(method='linear', axis=0, limit=2, limit_direction='both')

## 三、面向对象的数据处理

In [12]:
import warnings
warnings.filterwarnings("ignore")

实例化一个数据类，并查看其初步预处理的数据

In [13]:
from dataset_example import *

data_obj = KrakowDataset()
data_obj

<dataset_example.KrakowDataset at 0x197318f6c70>

In [14]:
data_obj.data

Unnamed: 0,171_temperature,171_humidity,171_pressure,171_pm1,171_pm25,171_pm10
2017-01-01 00:00:00,0.254545,0.082803,0.731568,0.639130,0.592334,0.567627
2017-01-01 01:00:00,0.236364,0.082803,0.722620,0.560870,0.522648,0.505543
2017-01-01 02:00:00,0.254545,0.082803,0.708840,0.582609,0.547038,0.532151
2017-01-01 03:00:00,0.236364,0.082803,0.699893,0.626087,0.595819,0.578714
2017-01-01 04:00:00,0.236364,0.082803,0.691303,0.595652,0.564460,0.549889
...,...,...,...,...,...,...
2017-12-24 20:00:00,0.345455,0.471338,0.598067,0.056522,0.048780,0.053215
2017-12-24 21:00:00,0.345455,0.471338,0.604331,0.052174,0.045296,0.046563
2017-12-24 22:00:00,0.327273,0.452229,0.606478,0.043478,0.038328,0.042129
2017-12-24 23:00:00,0.327273,0.471338,0.607552,0.043478,0.038328,0.042129
