# 11.6 重采样及频率转换

重采样（resampling）指的是将时间序列从一个频率转换到另一个频率的处理过程。将高频率数据聚合到低频率称为降采样（downsampling），而将低频率数据转换到高频率则称为升采样（upsampling）。并不是所有的重采样都能被划分到这两个大类中。例如，将W-WED（每周三）转换为W-FRI既不是降采样也不是升采样。

pandas对象都带有一个resample方法，它是各种频率转换工作的主力函数。resample有一个类似于groupby的API，调用resample可以分组数据，然后会调用一个聚合函数：

In [1]:
from datetime import datetime

import pandas as pd
import numpy as np

In [2]:
rng = pd.date_range('2000-01-01', periods=100, freq='D')

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts

2000-01-01    0.375103
2000-01-02    1.071034
2000-01-03    0.647248
2000-01-04    0.488860
2000-01-05    0.432354
                ...   
2000-04-05   -0.139699
2000-04-06   -0.705903
2000-04-07   -0.522485
2000-04-08   -1.385005
2000-04-09    0.805760
Freq: D, Length: 100, dtype: float64

In [3]:
ts.resample('M').mean()

2000-01-31   -0.261068
2000-02-29    0.143114
2000-03-31    0.175814
2000-04-30   -0.554309
Freq: M, dtype: float64

In [4]:
ts.resample('M', kind='period').mean()

2000-01   -0.261068
2000-02    0.143114
2000-03    0.175814
2000-04   -0.554309
Freq: M, dtype: float64

resample是一个灵活高效的方法，可用于处理非常大的时间序列。我将通过一系列的示例说明其用法。表11-5总结它的一些选项。

表11-5 resample方法的参数

## 降采样

将数据聚合到规律的低频率是一件非常普通的时间序列处理任务。待聚合的数据不必拥有固定的频率，期望的频率会自动定义聚合的面元边界，这些面元用于将时间序列拆分为多个片段。例如，要转换到月度频率（'M'或'BM'），数据需要被划分到多个单月时间段中。各时间段都是半开放的。一个数据点只能属于一个时间段，所有时间段的并集必须能组成整个时间帧。在用resample对数据进行降采样时，需要考虑两样东西：

- 各区间哪边是闭合的。
- 如何标记各个聚合面元，用区间的开头还是末尾。

为了说明，我们来看一些“1分钟”数据：

In [7]:
rng= pd.date_range('2000-01-01',
                    periods=12, 
                    freq='T'
                   )

rng

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00',
               '2000-01-01 00:02:00', '2000-01-01 00:03:00',
               '2000-01-01 00:04:00', '2000-01-01 00:05:00',
               '2000-01-01 00:06:00', '2000-01-01 00:07:00',
               '2000-01-01 00:08:00', '2000-01-01 00:09:00',
               '2000-01-01 00:10:00', '2000-01-01 00:11:00'],
              dtype='datetime64[ns]', freq='T')

In [10]:
ts = pd.Series(np.arange(12), 
               index=rng
              )

ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int64

假设你想要通过求和的方式将这些数据聚合到“5分钟”块中：

In [11]:
ts.resample('5min', closed='right').sum()

1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int64

传入的频率将会以“5分钟”的增量定义面元边界。默认情况下，面元的右边界是包含的，因此00:00到00:05的区间中是包含00:05的。传入closed='left'会让区间以左边界闭合：

如你所见，最终的时间序列是以各面元右边界的时间戳进行标记的。传入label='right'即可用面元的邮编界对其进行标记：

In [13]:
ts.resample('5min', closed='right', label='right').sum()

2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int64

图11-3说明了“1分钟”数据被转换为“5分钟”数据的处理过程。

最后，你可能希望对结果索引做一些位移，比如从右边界减去一秒以便更容易明白该时间戳到底表示的是哪个区间。只需通过loffset设置一个字符串或日期偏移量即可实现这个目的：

In [14]:
ts.resample('5min', 
            closed='right',
            label='right', 
            loffset='-1s'
           ).sum()



>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

  ts.resample('5min',


1999-12-31 23:59:59     0
2000-01-01 00:04:59    15
2000-01-01 00:09:59    40
2000-01-01 00:14:59    11
Freq: 5T, dtype: int64

In [16]:
ts.resample('5min', closed='right', label='right', loffset='-1s').sum()



>>> df.resample(freq="3s", loffset="8H")

becomes:

>>> from pandas.tseries.frequencies import to_offset
>>> df = df.resample(freq="3s").mean()
>>> df.index = df.index.to_timestamp() + to_offset("8H")

  ts.resample('5min', closed='right', label='right', loffset='-1s').sum()


1999-12-31 23:59:59     0
2000-01-01 00:04:59    15
2000-01-01 00:09:59    40
2000-01-01 00:14:59    11
Freq: 5T, dtype: int64

此外，也可以通过调用结果对象的shift方法来实现该目的，这样就不需要设置loffset了。

## OHLC重采样

金融领域中有一种无所不在的时间序列聚合方式，即计算各面元的四个值：第一个值（open，开盘）、最后一个值（close，收盘）、最大值（high，最高）以及最小值（low，最低）。传入how='ohlc'即可得到一个含有这四种聚合值的DataFrame。整个过程很高效，只需一次扫描即可计算出结果：

In [17]:
ts.resample('5min').ohlc()

Unnamed: 0,open,high,low,close
2000-01-01 00:00:00,0,4,0,4
2000-01-01 00:05:00,5,9,5,9
2000-01-01 00:10:00,10,11,10,11


##升采样和插值

在将数据从低频率转换到高频率时，就不需要聚合了。我们来看一个带有一些周型数据的DataFrame：

In [18]:
frame = pd.DataFrame(np.random.randn(2, 4),
                     index=pd.date_range('1/1/2000', 
                                         periods=2,
                                         freq='W-WED'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio']
                    )

frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,0.589634,0.957699,1.535916,-0.854045
2000-01-12,-0.098657,1.315946,-0.717368,0.168471


当你对这个数据进行聚合，每组只有一个值，这样就会引入缺失值。我们使用asfreq方法转换成高频，不经过聚合：

In [20]:
df_daily = frame.resample('D').asfreq()

df_daily

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,0.589634,0.957699,1.535916,-0.854045
2000-01-06,,,,
2000-01-07,,,,
2000-01-08,,,,
2000-01-09,,,,
2000-01-10,,,,
2000-01-11,,,,
2000-01-12,-0.098657,1.315946,-0.717368,0.168471


假设你想要用前面的周型值填充“非星期三”。resampling的填充和插值方式跟fillna和reindex的一样：

In [21]:
frame.resample('D').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,0.589634,0.957699,1.535916,-0.854045
2000-01-06,0.589634,0.957699,1.535916,-0.854045
2000-01-07,0.589634,0.957699,1.535916,-0.854045
2000-01-08,0.589634,0.957699,1.535916,-0.854045
2000-01-09,0.589634,0.957699,1.535916,-0.854045
2000-01-10,0.589634,0.957699,1.535916,-0.854045
2000-01-11,0.589634,0.957699,1.535916,-0.854045
2000-01-12,-0.098657,1.315946,-0.717368,0.168471


同样，这里也可以只填充指定的时期数（目的是限制前面的观测值的持续使用距离）：

注意，新的日期索引完全没必要跟旧的重叠：

In [22]:
frame.resample('W-THU').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-06,0.589634,0.957699,1.535916,-0.854045
2000-01-13,-0.098657,1.315946,-0.717368,0.168471


## 通过时期进行重采样

对那些使用时期索引的数据进行重采样与时间戳很像：

In [25]:
frame = pd.DataFrame(np.random.randn(24, 4),
                     index=pd.period_range('1-2000', '12-2001',freq='M'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio']
                    )

frame[:10]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01,-1.12243,0.13751,0.480341,-0.382167
2000-02,-0.141096,0.85305,-2.304634,1.19692
2000-03,-1.578841,-0.763457,-1.477681,-0.336945
2000-04,-0.157981,0.735355,-0.992008,0.011157
2000-05,0.832767,-0.741682,-0.209806,2.358965
2000-06,-0.067782,-0.359067,0.464881,-0.004299
2000-07,0.89628,0.483854,-0.535428,-1.240675
2000-08,-1.028906,0.974636,-0.177606,-0.925124
2000-09,-0.958059,0.353401,0.558435,0.018048
2000-10,2.065366,1.24746,-0.24953,-1.013787


In [27]:
annual_frame = frame.resample('A-DEC').mean()

annual_frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000,-0.166971,0.295861,-0.384832,0.060514
2001,-0.613921,-0.039024,-0.043503,-0.215634


升采样要稍微麻烦一些，因为你必须决定在新频率中各区间的哪端用于放置原来的值，就像asfreq方法那样。convention参数默认为'start'，也可设置为'end'：

In [28]:
# Q-DEC: Quarterly, year ending in December

annual_frame.resample('Q-DEC').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q1,-0.166971,0.295861,-0.384832,0.060514
2000Q2,-0.166971,0.295861,-0.384832,0.060514
2000Q3,-0.166971,0.295861,-0.384832,0.060514
2000Q4,-0.166971,0.295861,-0.384832,0.060514
2001Q1,-0.613921,-0.039024,-0.043503,-0.215634
2001Q2,-0.613921,-0.039024,-0.043503,-0.215634
2001Q3,-0.613921,-0.039024,-0.043503,-0.215634
2001Q4,-0.613921,-0.039024,-0.043503,-0.215634


In [29]:
annual_frame.resample('Q-DEC', convention='end').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,-0.166971,0.295861,-0.384832,0.060514
2001Q1,-0.166971,0.295861,-0.384832,0.060514
2001Q2,-0.166971,0.295861,-0.384832,0.060514
2001Q3,-0.166971,0.295861,-0.384832,0.060514
2001Q4,-0.613921,-0.039024,-0.043503,-0.215634


由于时期指的是时间区间，所以升采样和降采样的规则就比较严格：

- 在降采样中，目标频率必须是源频率的子时期（subperiod）。
- 在升采样中，目标频率必须是源频率的超时期（superperiod）。

如果不满足这些条件，就会引发异常。这主要影响的是按季、年、周计算的频率。例如，由Q-MAR定义的时间区间只能升采样为A-MAR、A-JUN、A-SEP、A-DEC等：

In [31]:
annual_frame.resample('Q-MAR').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,-0.166971,0.295861,-0.384832,0.060514
2001Q1,-0.166971,0.295861,-0.384832,0.060514
2001Q2,-0.166971,0.295861,-0.384832,0.060514
2001Q3,-0.166971,0.295861,-0.384832,0.060514
2001Q4,-0.613921,-0.039024,-0.043503,-0.215634
2002Q1,-0.613921,-0.039024,-0.043503,-0.215634
2002Q2,-0.613921,-0.039024,-0.043503,-0.215634
2002Q3,-0.613921,-0.039024,-0.043503,-0.215634
