## 150 pandas数据抽样（四）

### resample：重采样时序数据。实现频率转换和时序数据重采样。上采样和下采样(也有翻译为升采样和降采样)
#### 周期索引、多级索引、定频和上下采样等
参数 rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None

In [1]:
import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=9, freq='T')
series = pd.Series(range(9), index=index)

s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
                                            freq='A',
                                            periods=2))

q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',
                                                  freq='Q',
                                                  periods=4))

d = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
     'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018',
                                    periods=8,
                                    freq='W')

days = pd.date_range('1/1/2000', periods=4, freq='D')
d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
      'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
df2 = pd.DataFrame(
    d2,
    index=pd.MultiIndex.from_product(
        [days, ['morning', 'afternoon']]
    )
)

In [2]:
series

2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

In [31]:
series[::3] # 查询的方法。对照

2000-01-01 00:00:00    0
2000-01-01 00:03:00    3
2000-01-01 00:06:00    6
Freq: 3T, dtype: int64

In [4]:
series.resample('3T').asfreq()

2000-01-01 00:00:00    0
2000-01-01 00:03:00    3
2000-01-01 00:06:00    6
Freq: 3T, dtype: int64

In [5]:
series.resample('.5T').asfreq() # 上采样

2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    NaN
2000-01-01 00:01:00    1.0
2000-01-01 00:01:30    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    NaN
2000-01-01 00:03:00    3.0
2000-01-01 00:03:30    NaN
2000-01-01 00:04:00    4.0
2000-01-01 00:04:30    NaN
2000-01-01 00:05:00    5.0
2000-01-01 00:05:30    NaN
2000-01-01 00:06:00    6.0
2000-01-01 00:06:30    NaN
2000-01-01 00:07:00    7.0
2000-01-01 00:07:30    NaN
2000-01-01 00:08:00    8.0
Freq: 30S, dtype: float64

In [6]:
q

2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64

In [9]:
q.resample('M').sum()

2018-01    1.0
2018-02    NaN
2018-03    NaN
2018-04    2.0
2018-05    NaN
2018-06    NaN
2018-07    3.0
2018-08    NaN
2018-09    NaN
2018-10    4.0
2018-11    NaN
2018-12    NaN
Freq: M, dtype: float64

In [11]:
q.resample('M',convention='start').sum() # convention 设置区间内的位置，在开始或者结束

2018-01    1.0
2018-02    NaN
2018-03    NaN
2018-04    2.0
2018-05    NaN
2018-06    NaN
2018-07    3.0
2018-08    NaN
2018-09    NaN
2018-10    4.0
2018-11    NaN
2018-12    NaN
Freq: M, dtype: float64

In [13]:
q.resample('M',convention='end').sum() # convention 设置区间内的位置，在开始或者结束

2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

In [14]:
s

2012    1
2013    2
Freq: A-DEC, dtype: int64

In [15]:
s.resample('Q',convention='end').sum() # convention 设置区间内的位置，在开始或者结束

2012Q4    1.0
2013Q1    NaN
2013Q2    NaN
2013Q3    NaN
2013Q4    2.0
Freq: Q-DEC, dtype: float64

In [16]:
s.resample('Q',convention='s').sum() # convention 设置区间内的位置，在开始或者结束

2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

In [17]:
s.resample('Q',convention='s',kind='timestamp').sum() # kind 设置结果的数据类型 period timestamp

2012-03-31    1
2012-06-30    0
2012-09-30    0
2012-12-31    0
2013-03-31    2
Freq: Q-DEC, dtype: int64

In [18]:
s.resample('Q',convention='s',kind='period').sum() # kind 设置结果的数据类型 period timestamp

2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

In [19]:
df2

Unnamed: 0,Unnamed: 1,price,volume
2000-01-01,morning,10,50
2000-01-01,afternoon,11,60
2000-01-02,morning,9,40
2000-01-02,afternoon,13,100
2000-01-03,morning,14,50
2000-01-03,afternoon,18,100
2000-01-04,morning,17,40
2000-01-04,afternoon,19,50


In [27]:
df2.resample('D',level=0).sum() # level 设置多索引时使用哪个层级

Unnamed: 0,price,volume
2000-01-01,21,110
2000-01-02,22,140
2000-01-03,32,150
2000-01-04,36,90


In [29]:
df2.resample('D',level=0,kind='period').sum().index # kind 设置结果的数据类型 period timestamp

PeriodIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04'], dtype='period[D]')

In [30]:
df2.resample('D',level=0,kind='timestamp').sum().index # level # kind 设置结果的数据类型 period timestamp

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04'], dtype='datetime64[ns]', freq='D')