# 第9章 时序数据

In [1]:
import pandas as pd
import numpy as np

## 一、时序的创建
###  1. 四类时间变量
#### 现在理解可能关于③和④有些困惑，后面会作出一些说明

名称 | 描述 | 元素类型 | 创建方式  
:-|:-|:-|:-
① Date times（时间点/时刻） | 描述特定日期或时间点 | Timestamp | to_datetime或date_range
② Time spans（时间段/时期） | 由时间点定义的一段时期 | Period | Period或period_range
③ Date offsets（相对时间差） | 一段时间的相对大小（与夏/冬令时无关） | DateOffset | DateOffset
④ Time deltas（绝对时间差） | 一段时间的绝对大小（与夏/冬令时有关） | Timedelta | to_timedelta或timedelta_range

### 2. 时间点的创建

#### （a）to_datetime方法
#### Pandas在时间点建立的输入格式规定上给了很大的自由度，下面的语句都能正确建立同一时间点

In [2]:
pd.to_datetime('2020.1.1')
pd.to_datetime('2020 1.1')
pd.to_datetime('2020 1 1')
pd.to_datetime('2020 1-1')
pd.to_datetime('2020-1 1')
pd.to_datetime('2020-1-1')
pd.to_datetime('2020/1/1')
pd.to_datetime('1.1.2020')
pd.to_datetime('1.1 2020')
pd.to_datetime('1 1 2020')
pd.to_datetime('1 1-2020')
pd.to_datetime('1-1 2020')
pd.to_datetime('1-1-2020')
pd.to_datetime('1/1/2020')
pd.to_datetime('20200101')
pd.to_datetime('2020.0101')

Timestamp('2020-01-01 00:00:00')

#### 下面的语句都会报错

In [3]:
pd.to_datetime('2020\\1\\1')

Timestamp('2020-01-01 00:00:00')

In [4]:
# pd.to_datetime('2020`1`1')
# pd.to_datetime('2020.1 1')
# pd.to_datetime('1 1.2020')

#### 此时可利用format参数强制匹配

In [5]:
pd.to_datetime('2020\\1\\1', format='%Y\\%m\\%d')
pd.to_datetime('2020`1`1', format='%Y`%m`%d')
pd.to_datetime('2020.1 1', format='%Y.%m %d')
pd.to_datetime('1 1.2020', format='%d %m.%Y')

Timestamp('2020-01-01 00:00:00')

#### 同时，使用列表可以将其转为时间点索引
以时间序列作为索引值进行存储

In [8]:
# 时间类型的索引结点
pd.Series(range(2), index=pd.to_datetime(['2020/1/1', '2020/1/2']))

2020-01-01    0
2020-01-02    1
dtype: int64

In [9]:
# index变换
type(pd.to_datetime(['2020/1/1', '2020/1/2']))

pandas.core.indexes.datetimes.DatetimeIndex

#### 对于DataFrame而言，如果列已经按照时间顺序排好，则利用to_datetime可自动转换

In [10]:
df = pd.DataFrame({'year': [2020, 2020], 'month': [1, 1], 'day': [1, 2]})
df

Unnamed: 0,year,month,day
0,2020,1,1
1,2020,1,2


In [11]:
pd.to_datetime(df)

0   2020-01-01
1   2020-01-02
dtype: datetime64[ns]

#### （b）时间精度与范围限制
#### 事实上，Timestamp的精度远远不止day，可以最小到纳秒ns

In [8]:
pd.to_datetime('2020/1/1 00:00:00.123456789')

Timestamp('2020-01-01 00:00:00.123456789')

#### 同时，它带来范围的代价就是只有大约584年的时间点是可用的

In [21]:
pd.Timestamp.min

Timestamp('1677-09-21 00:12:43.145225')

In [22]:
pd.Timestamp.max

Timestamp('2262-04-11 23:47:16.854775807')

#### （c）date_range方法
#### 一般来说，start/end/periods（时间点个数）/freq（间隔方法）是该方法最重要的参数，给定了其中的3个，剩下的一个就会被确定

In [12]:
# periods表示需要平均切分的时间次数
pd.date_range(start='2020/1/1', end='2020/1/10', periods=3)

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-05 12:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)

In [13]:
pd.date_range(start='2020/1/1', end='2020/1/10', freq='D')

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')

In [31]:
# 向后计算时间
pd.date_range(start='2020/1/1', periods=3, freq='D')

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

In [33]:
# 向前计算时间,freq表示相对应的频率天数
pd.date_range(end='2020/1/3', periods=3, freq='D')

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

#### 其中freq参数有许多选项，下面将常用部分罗列如下，更多选项可看[这里](https://pandas.pydata.org/docs/user_guide/timeseries.html#offset-aliases)

符号 | D/B | W | M/Q/Y | BM/BQ/BY | MS/QS/YS | BMS/BQS/BYS | H | T | S
:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
描述 | 日/工作日 | 周 | 月末 | 月/季/年末日 | 月/季/年末工作日 | 月/季/年初日 | 月/季/年初工作日 | 小时 | 分钟 |秒

In [15]:
pd.date_range(start='2020/1/1', periods=3, freq='T')

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
               '2020-01-01 00:02:00'],
              dtype='datetime64[ns]', freq='T')

In [16]:
pd.date_range(start='2020/1/1', periods=3, freq='M')

DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'], dtype='datetime64[ns]', freq='M')

In [17]:
pd.date_range(start='2020/1/1', periods=3, freq='BYS')

DatetimeIndex(['2020-01-01', '2021-01-01', '2022-01-03'], dtype='datetime64[ns]', freq='BAS-JAN')

#### bdate_range是一个类似与date_range的方法，特点在于可以在自带的工作日间隔设置上，再选择weekmask参数和holidays参数
#### 它的freq中有一个特殊的'C'/'CBM'/'CBMS'选项，表示定制，需要联合weekmask参数和holidays参数使用
#### 例如现在需要将工作日中的周一、周二、周五3天保留，并将部分holidays剔除

In [18]:
weekmask = 'Mon Tue Fri'
holidays = [pd.Timestamp('2020/1/%s' % i) for i in range(7, 13)]
# 注意holidays
pd.bdate_range(start='2020-1-1', end='2020-1-15', freq='C',
               weekmask=weekmask, holidays=holidays)

DatetimeIndex(['2020-01-03', '2020-01-06', '2020-01-13', '2020-01-14'], dtype='datetime64[ns]', freq='C')

### 3. DateOffset对象

#### （a）DataOffset与Timedelta的区别
#### Timedelta绝对时间差的特点指无论是冬令时还是夏令时，增减1day都只计算24小时
#### DataOffset相对时间差指，无论一天是23\24\25小时，增减1day都与当天相同的时间保持一致
#### 例如，英国当地时间 2020年03月29日，01:00:00 时钟向前调整 1 小时 变为 2020年03月29日，02:00:00，开始夏令时

In [19]:
ts = pd.Timestamp('2020-3-29 01:00:00', tz='Europe/Helsinki')
ts + pd.Timedelta(days=1)

Timestamp('2020-03-30 02:00:00+0300', tz='Europe/Helsinki')

In [20]:
ts + pd.DateOffset(days=1)

Timestamp('2020-03-30 01:00:00+0300', tz='Europe/Helsinki')

#### 这似乎有些令人头大，但只要把tz（time zone）去除就可以不用管它了，两者保持一致，除非要使用到时区变换

In [21]:
ts = pd.Timestamp('2020-3-29 01:00:00')
ts + pd.Timedelta(days=1)

Timestamp('2020-03-30 01:00:00')

In [22]:
ts + pd.DateOffset(days=1)

Timestamp('2020-03-30 01:00:00')

#### （b）增减一段时间
#### DateOffset的可选参数包括years/months/weeks/days/hours/minutes/seconds

In [23]:
pd.Timestamp('2020-01-01') + pd.DateOffset(minutes=20) - pd.DateOffset(weeks=2)

Timestamp('2019-12-18 00:20:00')

#### （c）各类常用offset对象

freq | D/B | W | (B)M/(B)Q/(B)Y | (B)MS/(B)QS/(B)YS | H | T | S | C |
:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
offset | DateOffset/BDay | Week | (B)MonthEnd/(B)QuarterEnd/(B)YearEnd | (B)MonthBegin/(B)QuarterBegin/(B)YearBegin | Hour | Minute | Second | CDay(定制工作日)

In [24]:
pd.Timestamp('2020-01-01') + pd.offsets.Week(2)

Timestamp('2020-01-15 00:00:00')

In [25]:
pd.Timestamp('2020-01-01') + pd.offsets.BQuarterBegin(1)

Timestamp('2020-03-02 00:00:00')

#### （d）序列的offset操作
#### 利用apply函数

In [26]:
pd.Series(pd.offsets.BYearBegin(3).apply(i)
          for i in pd.date_range('20200101', periods=3, freq='Y'))

0   2023-01-02
1   2024-01-01
2   2025-01-01
dtype: datetime64[ns]

#### 直接使用对象加减

In [27]:
pd.date_range('20200101', periods=3, freq='Y') + pd.offsets.BYearBegin(3)

DatetimeIndex(['2023-01-02', '2024-01-01', '2025-01-01'], dtype='datetime64[ns]', freq='A-DEC')

#### 定制offset，可以指定weekmask和holidays参数（思考为什么三个都是一个值）

In [28]:
pd.Series(pd.offsets.CDay(3, weekmask='Wed Fri', holidays='2020010').apply(i)
          for i in pd.date_range('20200105', periods=3, freq='D'))

0   2020-01-15
1   2020-01-15
2   2020-01-15
dtype: datetime64[ns]

## 二、时序的索引及属性
### 1. 索引切片
#### 这一部分几乎与第二章的规则完全一致

In [36]:
# 以周为基本单位，进行时间划分
rng = pd.date_range('2020', '2021', freq='W')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.head()

2020-01-05   -0.068478
2020-01-12    0.061298
2020-01-19    0.277774
2020-01-26   -0.351321
2020-02-02   -1.842166
Freq: W-SUN, dtype: float64

#### 合法字符自动转换为时间点

In [43]:
ts['2020-01-26':'20200726'].head()

2020-01-26   -0.351321
2020-02-02   -1.842166
2020-02-09    0.895218
2020-02-16   -1.404452
2020-02-23    2.476737
Freq: W-SUN, dtype: float64

### 2. 子集索引

In [44]:
# 缺省搜索
ts['2020-7'].head()

2020-07-05    1.452022
2020-07-12    0.604092
2020-07-19   -1.277922
2020-07-26   -1.273663
Freq: W-SUN, dtype: float64

#### 支持混合形态索引

In [33]:
ts['2011-1':'20200726'].head()

2020-01-05   -0.275349
2020-01-12    2.359218
2020-01-19   -0.447633
2020-01-26   -0.479830
2020-02-02    0.517587
Freq: W-SUN, dtype: float64

### 3. 时间点的属性

#### 采用dt对象可以轻松获得关于时间的信息

In [67]:
pd.Series(ts.index).dt.week.head()

  pd.Series(ts.index).dt.week.head()


0    1
1    2
2    3
3    4
4    5
dtype: int64

In [68]:
pd.Series(ts.index).dt.day.head()

0     5
1    12
2    19
3    26
4     2
dtype: int64

#### 利用strftime可重新修改时间格式

In [36]:
# 把时间装换为指定的类型
pd.Series(ts.index).dt.strftime('%Y-间隔1-%m-间隔2-%d').head()

0    2020-间隔1-01-间隔2-05
1    2020-间隔1-01-间隔2-12
2    2020-间隔1-01-间隔2-19
3    2020-间隔1-01-间隔2-26
4    2020-间隔1-02-间隔2-02
dtype: object

#### 对于datetime对象可以直接通过属性获取信息

In [37]:
pd.date_range('2020', '2021', freq='W').month

Int64Index([ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  3,  4,  4,  4,  4,
             5,  5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8,
             8,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12,
            12],
           dtype='int64')

In [38]:
pd.date_range('2020', '2021', freq='W').weekday

Int64Index([6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
            6, 6, 6, 6, 6, 6, 6, 6],
           dtype='int64')

## 三、重采样（重点）

#### 所谓重采样，就是指resample函数，它可以<font color ="red">看做时序版本的groupby函数</font>

#### 1. resample对象的基本操作
#### 采样频率一般设置为上面提到的offset字符

In [31]:
df_r = pd.DataFrame(np.random.randn(1000, 3), index=pd.date_range('1/1/2020', freq='S', periods=1000),
                    columns=['A', 'B', 'C'])
df_r.head()

Unnamed: 0,A,B,C
2020-01-01 00:00:00,-0.2977,-1.008183,0.437384
2020-01-01 00:00:01,1.92813,-0.740173,0.300483
2020-01-01 00:00:02,0.419239,0.863555,-0.355854
2020-01-01 00:00:03,0.262151,-0.928389,0.119726
2020-01-01 00:00:04,-1.355956,-0.938609,2.689538


In [33]:
# 以一段时间进行采样然后加和(relu表示采样规则)
r = df_r.resample('3min')
r

<pandas.core.resample.DatetimeIndexResampler object at 0x7f0cec5ef1f0>

In [34]:
r.sum()

Unnamed: 0,A,B,C
2020-01-01 00:00:00,20.976761,18.35533,18.293814
2020-01-01 00:03:00,15.166632,-8.238867,-7.472075
2020-01-01 00:06:00,0.444612,-5.218266,-23.624539
2020-01-01 00:09:00,6.867767,2.259071,-10.915797
2020-01-01 00:12:00,-11.155568,-3.32476,19.881526
2020-01-01 00:15:00,-23.451414,-2.085044,7.967409


In [35]:
df_r2 = pd.DataFrame(np.random.randn(200, 3), index=pd.date_range('1/1/2020', freq='D', periods=200),
                     columns=['A', 'B', 'C'])
r = df_r2.resample('CBMS')
r.sum()

Unnamed: 0,A,B,C
2020-01-01,-4.499955,-3.405157,4.714295
2020-02-03,0.979204,2.537624,11.646748
2020-03-02,2.830045,-0.447905,8.438597
2020-04-01,8.203017,6.060698,2.213382
2020-05-01,-0.890577,-0.970423,-1.21821
2020-06-01,3.557233,-2.849328,-2.387477
2020-07-01,-3.864199,1.574663,1.142592


### 2. 采样聚合

In [43]:
r = df_r.resample('3T')

In [86]:
r['A'].mean()

2020-01-01   -0.017307
2020-02-03   -0.169221
2020-03-02   -0.096503
2020-04-01    0.020518
2020-05-01    0.068262
2020-06-01    0.218214
2020-07-01    0.297159
Freq: CBMS, Name: A, dtype: float64

In [87]:
r['A'].agg([np.sum, np.mean, np.std])

Unnamed: 0,sum,mean,std
2020-01-01,-0.571116,-0.017307,0.94053
2020-02-03,-4.738188,-0.169221,0.902684
2020-03-02,-2.895099,-0.096503,1.222134
2020-04-01,0.615551,0.020518,0.991366
2020-05-01,2.116136,0.068262,0.979724
2020-06-01,6.546415,0.218214,0.912452
2020-07-01,5.348862,0.297159,1.24204


In [94]:
r.agg({"A": [("mean_A", "mean"), ("A_std", "std")], "B": ["mean", "std"]})

Unnamed: 0_level_0,A,A,B,B
Unnamed: 0_level_1,mean_A,A_std,mean,std
2020-01-01,-0.017307,0.94053,-0.060468,0.928486
2020-02-03,-0.169221,0.902684,0.030897,1.18289
2020-03-02,-0.096503,1.222134,0.091393,1.005896
2020-04-01,0.020518,0.991366,-0.113549,0.976203
2020-05-01,0.068262,0.979724,0.170009,0.981872
2020-06-01,0.218214,0.912452,0.08056,1.065288
2020-07-01,0.297159,1.24204,-0.027014,0.824399


#### 类似地，可以使用函数/lambda表达式

In [96]:
# 聚合函数的使用
r.agg({'A': np.sum, 'B': lambda x: max(x)-min(x)})

Unnamed: 0,A,B
2020-01-01,-0.571116,3.890997
2020-02-03,-4.738188,5.389076
2020-03-02,-2.895099,4.375351
2020-04-01,0.615551,3.717044
2020-05-01,2.116136,4.170828
2020-06-01,6.546415,4.23104
2020-07-01,5.348862,3.340413


### 3. 采样组的迭代
#### 采样组的迭代和groupby迭代完全类似，对于每一个组都可以分别做相应操作

In [99]:
small = pd.Series(range(6), index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00',
# 默认以时间起始值作为开始                                                 '2020-01-01 00:31:00', '2020-01-01 01:00:00', '2020-01-01 03:00:00', '2020-01-01 03:05:00']))
resampled=small.resample('H')
for name, group in resampled:
    print("Group: ", name)
    print("-" * 27)
    print(group, end="\n\n")

Group:  2020-01-01 00:00:00
---------------------------
2020-01-01 00:00:00    0
2020-01-01 00:30:00    1
2020-01-01 00:31:00    2
dtype: int64

Group:  2020-01-01 01:00:00
---------------------------
2020-01-01 01:00:00    3
dtype: int64

Group:  2020-01-01 02:00:00
---------------------------
Series([], dtype: int64)

Group:  2020-01-01 03:00:00
---------------------------
2020-01-01 03:00:00    4
2020-01-01 03:05:00    5
dtype: int64



## 四、窗口函数

#### 下面主要介绍pandas中两类主要的窗口(window)函数:rolling/expanding

In [46]:
s = pd.Series(np.random.randn(1000),
              index=pd.date_range('1/1/2020', periods=1000))
s.head()

2020-01-01   -2.160463
2020-01-02    0.453226
2020-01-03   -1.126463
2020-01-04    0.426290
2020-01-05   -0.856071
Freq: D, dtype: float64

### 1. Rolling
#### （a）常用聚合
#### 所谓rolling方法，就是规定一个窗口，它和groupby对象一样，本身不会进行操作，需要配合聚合函数才能计算结果

In [38]:
s.rolling(window=50)

Rolling [window=50,center=False,axis=0]

In [39]:
# 类似与滑动窗口，必须窗口内的值满足所需要的大小，才会进行聚合运算
s.rolling(window=50).mean()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04         NaN
2020-01-05         NaN
                ...   
2022-09-22   -0.144998
2022-09-23   -0.148015
2022-09-24   -0.125213
2022-09-25   -0.156017
2022-09-26   -0.169323
Freq: D, Length: 1000, dtype: float64

In [40]:
s.rolling(window=50, min_periods=1).count()

2020-01-01     1.0
2020-01-02     2.0
2020-01-03     3.0
2020-01-04     4.0
2020-01-05     5.0
              ... 
2022-09-22    50.0
2022-09-23    50.0
2022-09-24    50.0
2022-09-25    50.0
2022-09-26    50.0
Freq: D, Length: 1000, dtype: float64

#### min_periods参数是指需要的非缺失数据点数量阀值
缺失数量不可超过该阈值

In [45]:
s[:3].mean()

-0.2895255323020124

In [41]:
s.rolling(window=50, min_periods=3).mean().head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03   -0.289526
2020-01-04   -0.117887
2020-01-05    0.087791
Freq: D, dtype: float64

#### count/sum/mean/median/min/max/std/var/skew/kurt/quantile/cov/corr都是常用的聚合函数
#### （b）rolling的apply聚合
#### 使用apply聚合时，只需记住传入的是window大小的Series，输出的必须是标量即可，比如如下计算变异系数

In [52]:
s.rolling(window=50, min_periods=3).apply(lambda x: x.std()/x.mean()).head()

2020-01-01          NaN
2020-01-02          NaN
2020-01-03   -10.018809
2020-01-04    -2.040720
2020-01-05    -1.463460
Freq: D, dtype: float64

#### 分组描述

In [152]:
# 当窗口大小不满足时，则通过累加运算完成聚合操作
for i, val in enumerate(s.rolling(window=50, min_periods=3)):
    print(val)
    print("--"*20)
    if i == 4:
        break

2020-01-01    0.980473
Freq: D, dtype: float64
----------------------------------------
2020-01-01    0.980473
2020-01-02    1.096003
Freq: D, dtype: float64
----------------------------------------
2020-01-01    0.980473
2020-01-02    1.096003
2020-01-03    3.207595
Freq: D, dtype: float64
----------------------------------------
2020-01-01    0.980473
2020-01-02    1.096003
2020-01-03    3.207595
2020-01-04    1.738455
Freq: D, dtype: float64
----------------------------------------
2020-01-01    0.980473
2020-01-02    1.096003
2020-01-03    3.207595
2020-01-04    1.738455
2020-01-05    0.341967
Freq: D, dtype: float64
----------------------------------------


#### （c）基于时间的rolling

In [153]:
s.rolling('15D').mean().head()

2020-01-01    0.980473
2020-01-02    1.038238
2020-01-03    1.761357
2020-01-04    1.755631
2020-01-05    1.472898
Freq: D, dtype: float64

#### 可选closed='right'（默认）\'left'\'both'\'neither'参数，决定端点的包含情况

In [154]:
s.rolling('15D', closed='right').sum().head()

2020-01-01    0.980473
2020-01-02    2.076476
2020-01-03    5.284071
2020-01-04    7.022525
2020-01-05    7.364492
Freq: D, dtype: float64

### 2. Expanding

#### （a）expanding函数
#### 普通的expanding函数等价与rolling(window=len(s),min_periods=1)，是对序列的累计计算，即窗口同等大小

In [155]:
s.rolling(window=len(s), min_periods=1).sum().head()

2020-01-01    0.980473
2020-01-02    2.076476
2020-01-03    5.284071
2020-01-04    7.022525
2020-01-05    7.364492
Freq: D, dtype: float64

In [156]:
# 累加操作，类似于序列中的[].append().sum()
s.expanding().sum().head()

2020-01-01    0.980473
2020-01-02    2.076476
2020-01-03    5.284071
2020-01-04    7.022525
2020-01-05    7.364492
Freq: D, dtype: float64

#### apply方法也是同样可用的

In [157]:
s.expanding().apply(lambda x: sum(x)).head()

2020-01-01    0.980473
2020-01-02    2.076476
2020-01-03    5.284071
2020-01-04    7.022525
2020-01-05    7.364492
Freq: D, dtype: float64

#### 分组描述

In [150]:
for i, val in enumerate(s.expanding()):
    print(val)
    print("--"*20)
    if i == 4:
        break

2020-01-01    0.980473
Freq: D, dtype: float64
----------------------------------------
2020-01-01    0.980473
2020-01-02    1.096003
Freq: D, dtype: float64
----------------------------------------
2020-01-01    0.980473
2020-01-02    1.096003
2020-01-03    3.207595
Freq: D, dtype: float64
----------------------------------------
2020-01-01    0.980473
2020-01-02    1.096003
2020-01-03    3.207595
2020-01-04    1.738455
Freq: D, dtype: float64
----------------------------------------
2020-01-01    0.980473
2020-01-02    1.096003
2020-01-03    3.207595
2020-01-04    1.738455
2020-01-05    0.341967
Freq: D, dtype: float64
----------------------------------------


#### （b）几个特别的Expanding类型函数
#### cumsum/cumprod/cummax/cummin都是特殊expanding累计计算方法

In [158]:
s.cumsum().head()

2020-01-01    0.980473
2020-01-02    2.076476
2020-01-03    5.284071
2020-01-04    7.022525
2020-01-05    7.364492
Freq: D, dtype: float64

In [160]:
s.cumprod().head()

2020-01-01    0.980473
2020-01-02    1.074601
2020-01-03    3.446884
2020-01-04    5.992252
2020-01-05    2.049151
Freq: D, dtype: float64

#### shift/diff/pct_change都是涉及到了元素关系
#### ①shift是<font color="red">指序列索引不变，但值向后移动</font>
#### ②diff是指前后元素的差，period参数表示间隔，默认为1，并且可以为负
#### ③pct_change是值前后元素的变化百分比，period参数与diff类似

In [170]:
s.head()

2020-01-01    0.980473
2020-01-02    1.096003
2020-01-03    3.207595
2020-01-04    1.738455
2020-01-05    0.341967
Freq: D, dtype: float64

In [166]:
# 正向移动
s.shift(2).head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03    0.980473
2020-01-04    1.096003
2020-01-05    3.207595
Freq: D, dtype: float64

In [168]:
# 反向移动
s.shift(-2).head()

2020-01-01    3.207595
2020-01-02    1.738455
2020-01-03    0.341967
2020-01-04    0.496723
2020-01-05   -0.845511
Freq: D, dtype: float64

In [162]:
# 正向-负向
s.diff(3).head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04    0.757982
2020-01-05   -0.754036
Freq: D, dtype: float64

In [47]:
s.diff(1)

2020-01-01         NaN
2020-01-02    2.613690
2020-01-03   -1.579689
2020-01-04    1.552752
2020-01-05   -1.282360
                ...   
2022-09-22   -0.562051
2022-09-23    0.295476
2022-09-24    1.115714
2022-09-25   -0.823282
2022-09-26    0.612999
Freq: D, Length: 1000, dtype: float64

In [50]:
# 等价与s.diff(1),即进行错位相减
(s - s.shift(-1).fillna(0)).head()

2020-01-01   -2.613690
2020-01-02    1.579689
2020-01-03   -1.552752
2020-01-04    1.282360
2020-01-05    0.002429
Freq: D, dtype: float64

In [52]:
s.pct_change(3).head()

2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04   -1.197314
2020-01-05   -2.888837
Freq: D, dtype: float64

## 五、问题与练习

#### 【问题一】 如何对date_range进行批量加帧操作或对某一时间段加大时间戳密度？
答：通过 date_range(start, end, periods) 来对某时间段加大密度操作, 比如固定时间段, 增加periods参数取值

#### 【问题二】 如何批量增加TimeStamp的精度？
答：pd.to_datetime(‘2020/1/1 00:00:00.00’) 最多至小数位后9位, 即纳秒.

#### 【问题三】 对于超出处理时间的时间点，是否真的完全没有处理方法？
答：待补充

#### 【问题四】 给定一组非连续的日期，怎么快速找出位于其最大日期和最小日期之间，且没有出现在该组日期中的日期？
答：日期排序

#### 【练习一】 现有一份关于某超市牛奶销售额的时间序列数据，请完成下列问题：

In [137]:
# 通过参数parse_dates将时间类数据变换为日期类
df = pd.read_csv('data/time_series_one.csv', parse_dates=["日期"])
df.head()

Unnamed: 0,日期,销售额
0,2017-02-17,2154
1,2017-02-18,2095
2,2017-02-19,3459
3,2017-02-20,2198
4,2017-02-21,2413


In [138]:
df.tail()

Unnamed: 0,日期,销售额
995,2019-11-09,3022
996,2019-11-10,2961
997,2019-11-11,3984
998,2019-11-12,2799
999,2019-11-13,2941


In [139]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   日期      1000 non-null   datetime64[ns]
 1   销售额     1000 non-null   int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 15.8 KB


#### （a）销售额出现最大值的是星期几？（提示：利用dayofweek函数）

In [140]:
# 获取每一个日期所属于周内数
df["日期"].dt.dayofweek

0      4
1      5
2      6
3      0
4      1
      ..
995    5
996    6
997    0
998    1
999    2
Name: 日期, Length: 1000, dtype: int64

In [141]:
df["销售额"].idxmax()

926

In [142]:
df["日期"].dt.dayofweek[df["销售额"].idxmax()]

6

#### （b）计算除去春节、国庆、五一节假日的月度销售总额

In [143]:
# 计算节假日期
holiday = pd.date_range(start='20170501', end='20170503').append(
    pd.date_range(start='20171001', end='20171007')).append(
    pd.date_range(start='20180215', end='20180221')).append(
    pd.date_range(start='20180501', end='20180503')).append(
    pd.date_range(start='20181001', end='20181007')).append(
    pd.date_range(start='20190204', end='20190224')).append(
    pd.date_range(start='20190501', end='20190503')).append(
    pd.date_range(start='20191001', end='20191007'))

In [144]:
df1 = df[~df["日期"].isin(holiday)].set_index("日期")
df1.head()

Unnamed: 0_level_0,销售额
日期,Unnamed: 1_level_1
2017-02-17,2154
2017-02-18,2095
2017-02-19,3459
2017-02-20,2198
2017-02-21,2413


In [145]:
df1.resample("MS").agg({"销售额": [("总销售额", "sum")]})

Unnamed: 0_level_0,销售额
Unnamed: 0_level_1,总销售额
日期,Unnamed: 1_level_2
2017-02-01,31740
2017-03-01,80000
2017-04-01,74734
2017-05-01,76237
2017-06-01,80750
...,...
2019-07-01,90902
2019-08-01,93664
2019-09-01,89077
2019-10-01,72099


#### （c）按季度计算周末（周六和周日）的销量总额

In [147]:
# QS表示季度计算
df[df["日期"].dt.dayofweek.isin([5, 6])].set_index("日期").resample("QS").sum()

Unnamed: 0_level_0,销售额
日期,Unnamed: 1_level_1
2017-01-01,32894
2017-04-01,66692
2017-07-01,69099
2017-10-01,70384
2018-01-01,74671
...,...
2018-10-01,74699
2019-01-01,77835
2019-04-01,77042
2019-07-01,76276


#### （d）从最后一天开始算起，跳过周六和周一，以5天为一个时间单位向前计算销售总和

#### （e）假设现在发现数据有误，所有同一周里的周一与周五的销售额记录颠倒了，请计算2018年中每月第一个周一的销售额（如果该周没有周一或周五的记录就保持不动）

In [150]:
df_temp = df.copy()

In [161]:
df

Unnamed: 0,日期,销售额
0,2017-02-17,2154
1,2017-02-18,2095
2,2017-02-19,3459
3,2017-02-20,2198
4,2017-02-21,2413
...,...,...
995,2019-11-09,3022
996,2019-11-10,2961
997,2019-11-11,3984
998,2019-11-12,2799


In [163]:
df.shift(4)[df.shift(4)["日期"].dt.dayofweek == 1]

Unnamed: 0,日期,销售额
8,2017-02-21,2413.0
15,2017-02-28,2167.0
22,2017-03-07,2091.0
29,2017-03-14,2358.0
36,2017-03-21,2219.0
...,...,...
967,2019-10-08,3272.0
974,2019-10-15,3421.0
981,2019-10-22,2946.0
988,2019-10-29,2601.0


#### 【练习二】 继续使用上一题的数据，请完成下列问题：

#### （a）以50天为窗口计算滑窗均值和滑窗最大值（min_periods设为1）

In [181]:
df = pd.read_csv("./data/time_series_one.csv",
                 index_col="日期", parse_dates=["日期"])
# 必须为Series类型的数据，索引为时间类型的索引数据
df["销售额"].rolling(window=50, min_periods=1).max().head()

日期
2017-02-17    2154.0
2017-02-18    2154.0
2017-02-19    3459.0
2017-02-20    3459.0
2017-02-21    3459.0
Name: 销售额, dtype: float64

In [182]:
df["销售额"].rolling(window=50, min_periods=1).mean().head()

日期
2017-02-17    2154.000000
2017-02-18    2124.500000
2017-02-19    2569.333333
2017-02-20    2476.500000
2017-02-21    2463.800000
Name: 销售额, dtype: float64

#### （b）现在有如下规则：若当天销售额超过向前5天的均值，则记为1，否则记为0，请给出2018年相应的计算结果

In [186]:
# 解题思路：以时间序列为窗口，计算六天内的数据
df_rolling = df.loc[pd.date_range("20171227", "20181231"), :].rolling(
    window=6, min_periods=1)

In [188]:
def f(x):

    if len(x) == 6:
        return 1 if x[-1] > np.mean(x[:-1]) else 0
    else:
        return 0

In [194]:
df_res = df_rolling.agg(f)[5:]
df_res.head()

Unnamed: 0,销售额
2018-01-01,1.0
2018-01-02,0.0
2018-01-03,0.0
2018-01-04,0.0
2018-01-05,0.0


#### （c）将(c)中的“向前5天”改为“向前非周末5天”，请再次计算结果