# Time series

+ 时间序列是非常重要的数据形式。在一系列时间点上观察到的数据序列就是时间序列数据。
+ 通常的时间序列数据根据数据采样的时间间隔可以分为固定频率或者非固定频率两类。

+ 我们分析的很多时间序列数据大部分是固定频率的序列，
  - 比如每季度、每年按期公布的宏观经济数据（GDP，CPI等）。
  - 当然这里的固定频率也并非一定是绝对的等时间间隔数据，比如我们通常说的日收益率，虽然都间隔一个交易日，但因为假期等因素的影响，数据发生的绝对时间并不是等间隔的。
+ 不等时间间隔的数据，如果我们将每一笔成交的数据都记录下来，得到的数据一般不是等时间间隔的。

**三种时间类型**
+ 时间戳 timestamp, 特定时间
+ 固定时期，period，如2007年1月或2010年全年
+ 时间间隔 interval

+ Pandas 提供了一组标准的时间序列处理工具和数据算法，这些工具在处理金融和经济数据非常有用
+ 和前面章节一样，我们首先导入一些包和函数

In [44]:
from pandas import Series, DataFrame
import pandas as pd
from numpy.random import randn
import numpy as np
pd.options.display.max_rows = 12
np.set_printoptions(precision=4, suppress=True)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(12, 4))

+ 查看当前工作目录

In [45]:
%pwd

'c:\\Users\\16017\\Desktop\\Dropbox\\金融数据分析导引\\Guide_to_financial_data_analysis\\Lectures\\Lecture5'

+ 下面命令使得图形能出现在notebook页面中

In [46]:
%matplotlib inline
#%matplotlib qt

## 日期和时间数据类型及工具
+ datetime模块包中提供了处理日期时间数据的工具, 以毫秒储存日期和时间，


In [47]:
from datetime import datetime
now = datetime.now()
now#到毫秒

datetime.datetime(2023, 10, 10, 16, 13, 17, 178352)

+ 下面分别给出了 年、月、日、时、分、秒、微秒(百万分之一秒)

In [48]:
now.year, now.month, now.day,now.hour,now.minute,now.second,now.microsecond 

(2023, 10, 10, 16, 13, 17, 178352)

+ 我们可以直接对时间进行减法运算，得到两者的时间差

In [49]:
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta

datetime.timedelta(days=926, seconds=56700)

In [50]:
delta.days

926

In [51]:
delta.seconds

56700

+ 也可以直接构造时间差数据类型,得到给定间隔的时间,timedelta表示两个datetime对象的时间差。时间和时间差之间可以进行加减运算。

In [52]:
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(days=120)

datetime.datetime(2011, 5, 7, 0, 0)

In [53]:
start - 2 * timedelta(days=12)

datetime.datetime(2010, 12, 14, 0, 0)

### 字符串和datetime的相互转换

+ str或者strftime方法可以将日期时间转换为字符串


In [54]:
stamp = datetime(2011, 1, 3,6)

In [55]:
str(stamp)

'2011-01-03 06:00:00'

In [56]:
stamp.strftime('%Y%m%d')
# %Y:四位的年，$y两位的年，$m月，$d日

'20110103'

+ strptime可以将已知格式的日期时间字符串解析为时间类型 

In [57]:
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d')#要给出格式

datetime.datetime(2011, 1, 3, 0, 0)

In [58]:
datetime.strptime("2001-01-10 10:10:10.5", "%Y-%m-%d %I:%M:%S.%f")
# %M 分，%S 秒

datetime.datetime(2001, 1, 10, 10, 10, 10, 500000)

In [59]:
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

datetime格式：

格式 | 含义
------ | ------
%Y | 四位年
%y |两位年
%m |2位月
%d |两位日
%H |时 24小时制
%I |时 12小时制
%M |分钟
%S |秒
%w |星期几,0表示周日
%W |每年的第几周 [0，53]，每年第一个星期一之前的那周为第0周
%F |%Y-%m-%d 简写 2014-2-1
%D |%m/%d/%y 简写 04/16/12


+ pandas中的to_datetime 方法也可以解析成组日期。

In [60]:
datestrs

['7/6/2011', '8/6/2011']

In [61]:
pd.to_datetime(datestrs)
# note: output changed (no '00:00:00' anymore)

DatetimeIndex(['2011-07-06', '2011-08-06'], dtype='datetime64[ns]', freq=None)

## Time Series Basics
Pandas 中最基本是时间序列类型是以时间为索引的Series

In [62]:
from datetime import datetime
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = Series(np.random.randn(6), index=dates)#dates as the index of the new Series
ts

2011-01-02    0.730945
2011-01-05   -1.413019
2011-01-07   -0.985091
2011-01-08   -0.115496
2011-01-10    0.965768
2011-01-12   -0.505679
dtype: float64

+ 不同索引时间序列之间的运算会自动按时间对齐。

In [63]:
ts + ts[::2]
##：：2表示每取0，2，4


2011-01-02    1.461890
2011-01-05         NaN
2011-01-07   -1.970182
2011-01-08         NaN
2011-01-10    1.931536
2011-01-12         NaN
dtype: float64

### Generating date ranges

In [64]:
index = pd.date_range('4/1/2012', '6/1/2012')
index

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
      

In [65]:
pd.date_range(start='4/1/2012', periods=20)

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')

In [66]:
pd.date_range?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mdate_range[0m[1;33m([0m[1;33m
[0m    [0mstart[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mend[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mperiods[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfreq[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtz[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m:[0m [1;34m'Hashable'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mclosed[0m[1;33m:[0m [1;34m"Literal['left', 'right'] | None | lib.NoDefault"[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0minclusive[0m[1;33m:[0m [1;34m'IntervalClosedType | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m**[0m[0mkwargs[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33

In [67]:
pd.date_range(end='6/1/2012', periods=20)

DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

In [68]:
pd.date_range('1/1/2015', '12/1/2015', freq='BM')
# freq选项很多，上面表示每月的最后一个工作日

DatetimeIndex(['2015-01-30', '2015-02-27', '2015-03-31', '2015-04-30',
               '2015-05-29', '2015-06-30', '2015-07-31', '2015-08-31',
               '2015-09-30', '2015-10-30', '2015-11-30'],
              dtype='datetime64[ns]', freq='BM')

In [69]:
pd.date_range('1/1/2015', '12/1/2015', freq="2H")

DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 02:00:00',
               '2015-01-01 04:00:00', '2015-01-01 06:00:00',
               '2015-01-01 08:00:00', '2015-01-01 10:00:00',
               '2015-01-01 12:00:00', '2015-01-01 14:00:00',
               '2015-01-01 16:00:00', '2015-01-01 18:00:00',
               ...
               '2015-11-30 06:00:00', '2015-11-30 08:00:00',
               '2015-11-30 10:00:00', '2015-11-30 12:00:00',
               '2015-11-30 14:00:00', '2015-11-30 16:00:00',
               '2015-11-30 18:00:00', '2015-11-30 20:00:00',
               '2015-11-30 22:00:00', '2015-12-01 00:00:00'],
              dtype='datetime64[ns]', length=4009, freq='2H')

In [70]:
pd.date_range('5/2/2012 12:56:31', periods=5)

DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

In [71]:
pd.date_range('5/2/2012 12:56:31', periods=5, normalize=True)
#标准化

DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

### 索引、选取、子集构造

 
+ 可以通过传入年，月，日进行选择Series的子集，非常方便。
+ 甚至可以非常直观的转入时间序列字符串设置选取的时间范围。

In [73]:
ts

2011-01-02    0.730945
2011-01-05   -1.413019
2011-01-07   -0.985091
2011-01-08   -0.115496
2011-01-10    0.965768
2011-01-12   -0.505679
dtype: float64

In [72]:
stamp = ts.index[2]
print(stamp)
ts[stamp]


2011-01-07 00:00:00


-0.9850910765894495

In [74]:
ts['1/10/2011']#也可以认出来，可以理解为是先把字符串改变为标准的时间格式，再去寻找的

0.9657677692569688

In [75]:
ts['20110110']

0.9657677692569688

In [77]:
longer_ts = Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))
longer_ts

2000-01-01   -0.538501
2000-01-02   -0.867951
2000-01-03   -0.886711
2000-01-04   -0.409106
2000-01-05   -0.658953
                ...   
2002-09-22    0.105938
2002-09-23    1.182932
2002-09-24   -0.499720
2002-09-25    1.215944
2002-09-26    1.617547
Freq: D, Length: 1000, dtype: float64

In [78]:
longer_ts['2001']#取出所有的2001年的数据

2001-01-01   -1.032675
2001-01-02    0.498920
2001-01-03    2.343687
2001-01-04    0.292963
2001-01-05    0.835889
                ...   
2001-12-27   -0.342962
2001-12-28   -0.358853
2001-12-29    0.928742
2001-12-30   -0.595095
2001-12-31    0.694819
Freq: D, Length: 365, dtype: float64

In [79]:
longer_ts['2001-05']#取出所有2001-05的所有数据

2001-05-01   -0.592862
2001-05-02    0.008138
2001-05-03    1.264762
2001-05-04    0.280549
2001-05-05   -1.584563
                ...   
2001-05-27   -0.691740
2001-05-28   -0.427595
2001-05-29   -1.316458
2001-05-30   -1.304623
2001-05-31    0.388473
Freq: D, Length: 31, dtype: float64

In [80]:
ts

2011-01-02    0.730945
2011-01-05   -1.413019
2011-01-07   -0.985091
2011-01-08   -0.115496
2011-01-10    0.965768
2011-01-12   -0.505679
dtype: float64

In [81]:
ts[datetime(2011, 1, 7):]#取出2001-01-07及其之后的数据

2011-01-07   -0.985091
2011-01-08   -0.115496
2011-01-10    0.965768
2011-01-12   -0.505679
dtype: float64

In [82]:
ts.truncate(after='1/9/2011')#使用面向对象的形式也行

2011-01-02    0.730945
2011-01-05   -1.413019
2011-01-07   -0.985091
2011-01-08   -0.115496
dtype: float64

In [83]:
ts['1/7/2011':'1/12/2011']#与之前的切片不同，包含了初始值和终止值

2011-01-07   -0.985091
2011-01-08   -0.115496
2011-01-10    0.965768
2011-01-12   -0.505679
dtype: float64

In [91]:
dates = pd.date_range('10/10/2023', periods=100, freq='W-TUE')
long_df = DataFrame(np.random.randn(100, 4),
                    index=dates,
                    columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.loc['2023-10']
# 修改时间，通过输出推测W-WED的含义

Unnamed: 0,Colorado,Texas,New York,Ohio
2023-10-10,-0.033112,-0.220518,1.327467,2.026256
2023-10-17,-1.997557,1.557152,-0.044465,1.292714
2023-10-24,1.761324,-0.095386,3.580562,0.254803
2023-10-31,1.389296,-0.91437,1.934397,0.16398


### Time series with duplicate indices***
有时候一个时间点上可能有多个观察，即索引可以有重复，这时选择得到的数据可能是一组数据。

In [93]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
                          '1/3/2000'])
dup_ts = Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

In [95]:
dup_ts.index.is_unique#判断index是不是unique的

False

In [96]:
dup_ts['1/3/2000']  # not duplicated

4

In [97]:
dup_ts['1/2/2000']  # duplicated

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

In [100]:
grouped = dup_ts.groupby(level=0)#level=0 实际上只有一个Index, not multi-index
grouped.mean()
#level是怎么用的？

2000-01-01    0.0
2000-01-02    2.0
2000-01-03    4.0
dtype: float64

In [101]:
dup_ts.groupby?
# 看LEVEL的解释和例子

[1;31mSignature:[0m
[0mdup_ts[0m[1;33m.[0m[0mgroupby[0m[1;33m([0m[1;33m
[0m    [0mby[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mlevel[0m[1;33m:[0m [1;34m'Level'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mas_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0msort[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mgroup_keys[0m[1;33m:[0m [1;34m'bool | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0msqueeze[0m[1;33m:[0m [1;34m'bool | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mobserved[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mdropna[0m[1;33m:[0m [1;34m'bool'[0m [1

In [102]:
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

In [103]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

## Date ranges, Frequencies, and Shifting

+ 使用resample方法进行取样
+ pandas中生成一串时间的函数 date_range


In [104]:
ts

2011-01-02    0.730945
2011-01-05   -1.413019
2011-01-07   -0.985091
2011-01-08   -0.115496
2011-01-10    0.965768
2011-01-12   -0.505679
dtype: float64

In [105]:
ts.resample('D').sum()
#对于ts，按照日抽样，（会默认补全0），

2011-01-02    0.730945
2011-01-03    0.000000
2011-01-04    0.000000
2011-01-05   -1.413019
2011-01-06    0.000000
2011-01-07   -0.985091
2011-01-08   -0.115496
2011-01-09    0.000000
2011-01-10    0.965768
2011-01-11    0.000000
2011-01-12   -0.505679
Freq: D, dtype: float64

In [106]:
ts.resample('3D').sum()

2011-01-02    0.730945
2011-01-05   -2.398110
2011-01-08    0.850272
2011-01-11   -0.505679
Freq: 3D, dtype: float64

###  频率和日期偏移量

In [108]:
from pandas.tseries.offsets import Hour, Minute
hour = Hour(2)
hour

<2 * Hours>

In [109]:
pd.date_range('1/1/2000', '1/3/2000 23:59', freq='4h')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

In [110]:
Hour(2) + Minute(30)

<150 * Minutes>

In [115]:
pd.date_range(start='1/1/2000', periods=10, freq=(Hour(1) + Minute(30)))

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

+ 下面的freq选项表示每月第三个星期五，WOM表示 week of month

In [119]:
rng = pd.date_range(start='9/1/2023', end='11/1/2023', freq='WOM-3Fri')#每月第三个星期五

list(rng)

[Timestamp('2023-09-15 00:00:00', freq='WOM-3FRI'),
 Timestamp('2023-10-20 00:00:00', freq='WOM-3FRI')]

### 数据时间的平移

将时间序列数据按时间平移，可以用来进行差分或者增长率计算。

In [120]:
ts = Series(np.random.randn(4),
            index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts

2000-01-31    2.254607
2000-02-29   -0.204677
2000-03-31    0.980872
2000-04-30    0.658216
Freq: M, dtype: float64

In [121]:
ts.shift(1)#把value向下平移一位

2000-01-31         NaN
2000-02-29    2.254607
2000-03-31   -0.204677
2000-04-30    0.980872
Freq: M, dtype: float64

In [123]:
ts.shift(-2)#把value向上平移2位

2000-01-31    0.980872
2000-02-29    0.658216
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

In [124]:
ts/ts.shift(1)-1 #这个月/上个月 - 1

2000-01-31         NaN
2000-02-29   -1.090782
2000-03-31   -5.792283
2000-04-30   -0.328949
Freq: M, dtype: float64

In [127]:
ts.shift(3, freq='M')#移动time，而不是value

2000-04-30    2.254607
2000-05-31   -0.204677
2000-06-30    0.980872
2000-07-31    0.658216
Freq: M, dtype: float64

In [128]:
ts

2000-01-31    2.254607
2000-02-29   -0.204677
2000-03-31    0.980872
2000-04-30    0.658216
Freq: M, dtype: float64

In [129]:
ts.shift(3, freq='D')
# 相当于 freq=3D

2000-02-03    2.254607
2000-03-03   -0.204677
2000-04-03    0.980872
2000-05-03    0.658216
dtype: float64

In [130]:
ts.shift(1, freq='3D')

2000-02-03    2.254607
2000-03-03   -0.204677
2000-04-03    0.980872
2000-05-03    0.658216
dtype: float64

+ 移动90秒

In [131]:
ts.shift(1, freq='90T')

2000-01-31 01:30:00    2.254607
2000-02-29 01:30:00   -0.204677
2000-03-31 01:30:00    0.980872
2000-04-30 01:30:00    0.658216
dtype: float64

## 日期及其算术运算

+ A- 表示每年指定月份
+ Period 对象表示一段时间
+ 下列表示2006年12月1日到2007年11月30日之间的整段时间

In [132]:
p = pd.Period(2007, freq='A-NOV')
p

Period('2007', 'A-NOV')

In [133]:
p.start_time,p.end_time

(Timestamp('2006-12-01 00:00:00'), Timestamp('2007-11-30 23:59:59.999999999'))

Period对象加减一个整数可根据其频率实现位移效果 

In [134]:
p + 5#根据的是对象的频率

Period('2012', 'A-NOV')

In [135]:
p - 2

Period('2005', 'A-NOV')

+ 两个等频率的Period对象的差就是他们之间的单位数量

In [136]:
pd.Period('2014', freq='A-NOV')-p

<7 * YearEnds: month=11>

+ 表示 时期序列

In [139]:
rng = pd.period_range(start='1/1/2000', end='6/30/2000', freq='M')
rng

PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]')

In [140]:
Series(np.random.randn(6), index=rng)

2000-01    0.516456
2000-02   -2.484818
2000-03    0.165181
2000-04   -0.653130
2000-05    0.736969
2000-06   -1.107977
Freq: M, dtype: float64

In [153]:
values = ['2001Q3', '2002Q2', '2003Q1']
index = pd.PeriodIndex(values, freq='Q-DEC')#季度频率, 樂，还得专门做一个PeriodIndex对象。。。
index

PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]')

In [166]:
Series(np.random.randn(3), index=['2001Q3', '2002Q2', '2003Q1'])

2001Q3   -1.588671
2002Q2   -1.193946
2003Q1   -2.452930
dtype: float64

In [167]:
Series(np.random.randn(3), index=index)

2001Q3   -2.019992
2002Q2    2.632031
2003Q1   -0.533102
Freq: Q-DEC, dtype: float64

### 时期频率转换

通过asfreq方法可以将时期或时期index转换为别的频率

In [172]:
p = pd.Period('2007', freq='A-DEC')
p

Period('2007', 'A-DEC')

In [173]:
p.asfreq('M', how='start')

Period('2007-01', 'M')

In [175]:
p#原来的p没有改变

Period('2007', 'A-DEC')

In [176]:
p.asfreq('M', how='end')

Period('2007-12', 'M')

In [177]:
p = pd.Period('2007', freq='A-JUN')
p.asfreq('M', 'start')

Period('2006-07', 'M')

In [178]:
p.asfreq('M', 'end')

Period('2007-06', 'M')

In [179]:
p = pd.Period('Aug-2007', 'M')
p.asfreq('A-JUN')

Period('2008', 'A-JUN')

In [180]:
rng = pd.period_range('2006', '2009', freq='A-DEC')
ts = Series(np.random.randn(len(rng)), index=rng)
ts

2006    0.580969
2007   -0.268509
2008    0.912624
2009    0.231478
Freq: A-DEC, dtype: float64

In [181]:
ts.asfreq('M', how='start')

2006-01    0.580969
2007-01   -0.268509
2008-01    0.912624
2009-01    0.231478
Freq: M, dtype: float64

In [182]:
ts.asfreq('B', how='end')

2006-12-29    0.580969
2007-12-31   -0.268509
2008-12-31    0.912624
2009-12-31    0.231478
Freq: B, dtype: float64

### 按季度计算的时期频率

+ 季度型数据在会计、金融等领域很常见。很多数据都会涉及财年末的概念，通常表示数据刻画的时期几月开始，几月结束。用Q-表示


In [183]:
p = pd.Period('2012Q4', freq='Q-JAN')#季度
p

Period('2012Q4', 'Q-JAN')

In [184]:
p.asfreq('D', 'start')

Period('2011-11-01', 'D')

In [185]:
p.asfreq('D', 'end')

Period('2012-01-31', 'D')

下面的例子获取 季度倒数第二个工作日下午4点的时间戳

In [187]:
p4pm = (p.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
p4pm

Period('2012-01-30 16:00', 'T')

In [189]:
p4pm.to_timestamp()

Timestamp('2012-01-30 16:00:00')

In [188]:
p4pm.to_timestamp?

[1;31mDocstring:[0m
Return the Timestamp representation of the Period.

Uses the target frequency specified at the part of the period specified
by `how`, which is either `Start` or `Finish`.

Parameters
----------
freq : str or DateOffset
    Target frequency. Default is 'D' if self.freq is week or
    longer and 'S' otherwise.
how : str, default 'S' (start)
    One of 'S', 'E'. Can be aliased as case insensitive
    'Start', 'Finish', 'Begin', 'End'.

Returns
-------
Timestamp
[1;31mType:[0m      builtin_function_or_method

下面的例子生成季度型范围

In [86]:
rng = pd.period_range('2011Q3', '2012Q4', freq='Q-JAN')
ts = Series(np.arange(len(rng)), index=rng)
ts

2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

In [87]:
new_rng = (rng.asfreq('B', 'e') - 1).asfreq('T', 's') + 16 * 60
ts.index = new_rng.to_timestamp()
ts

2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

### 将 Timestamps 转换为 Periods 

+ to_period方法
+ to_timestamp 方法

In [88]:
rng = pd.date_range('1/1/2000', periods=3, freq='M')
ts = Series(randn(3), index=rng)
pts = ts.to_period()


In [89]:
print(ts)
pts

2000-01-31    0.198275
2000-02-29   -0.281490
2000-03-31    0.231296
Freq: M, dtype: float64


2000-01    0.198275
2000-02   -0.281490
2000-03    0.231296
Freq: M, dtype: float64

In [90]:
rng = pd.date_range('1/29/2000', periods=6, freq='D')
ts2 = Series(randn(6), index=rng)
ts2.to_period('M')

2000-01    1.098593
2000-01   -0.180997
2000-01    0.907144
2000-02   -1.147550
2000-02    0.763143
2000-02   -1.354875
Freq: M, dtype: float64

In [91]:
pts = ts.to_period()
pts

2000-01    0.198275
2000-02   -0.281490
2000-03    0.231296
Freq: M, dtype: float64

In [92]:
pts.to_timestamp(how='end')

2000-01-31 23:59:59.999999999    0.198275
2000-02-29 23:59:59.999999999   -0.281490
2000-03-31 23:59:59.999999999    0.231296
Freq: M, dtype: float64

**区别？**
+ period 表示一段时间，timestamp表示一个时间点
+ 可以利用的方法不一样
+ period可以用来检查该段时间内是否发生什么事件。比如飞机降落，股票平均价格

In [93]:
p = pd.Period('2019-10-07')
test = pd.Timestamp('2019-10-07 9:11')
p.start_time < test < p.end_time


True

In [94]:
p.start_time, p.end_time


(Timestamp('2019-10-07 00:00:00'), Timestamp('2019-10-07 23:59:59.999999999'))

### 通过数组创建 PeriodIndex

In [95]:
data = pd.read_csv('data/macrodata.csv')
data.year

0      1959.0
1      1959.0
2      1959.0
3      1959.0
4      1960.0
        ...  
198    2008.0
199    2008.0
200    2009.0
201    2009.0
202    2009.0
Name: year, Length: 203, dtype: float64

In [96]:
data.quarter

0      1.0
1      2.0
2      3.0
3      4.0
4      1.0
      ... 
198    3.0
199    4.0
200    1.0
201    2.0
202    3.0
Name: quarter, Length: 203, dtype: float64

In [97]:
index = pd.PeriodIndex(year=data.year, quarter=data.quarter, freq='Q-DEC')
index

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')

In [98]:
data.index = index
data

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
1959Q1,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.980,139.7,2.82,5.8,177.146,0.00,0.00
1959Q2,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.150,141.7,3.08,5.1,177.830,2.34,0.74
1959Q3,1959.0,3.0,2775.488,1751.8,289.226,491.260,1916.4,29.350,140.5,3.82,5.3,178.657,2.74,1.09
1959Q4,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.370,140.0,4.33,5.6,179.386,0.27,4.06
1960Q1,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.540,139.6,3.50,5.2,180.007,2.31,1.19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2008Q3,2008.0,3.0,13324.600,9267.7,1990.693,991.551,9838.3,216.889,1474.7,1.17,6.0,305.270,-3.16,4.33
2008Q4,2008.0,4.0,13141.920,9195.3,1857.661,1007.273,9920.4,212.174,1576.5,0.12,6.9,305.952,-8.79,8.91
2009Q1,2009.0,1.0,12925.410,9209.2,1558.494,996.287,9926.4,212.671,1592.8,0.22,8.1,306.547,0.94,-0.71
2009Q2,2009.0,2.0,12901.504,9189.0,1456.678,1023.528,10077.5,214.469,1653.6,0.18,9.2,307.226,3.37,-3.19


## 重采样和频率转换
重采样及频率转换。在进行金融时间序列分析时，我们有时候需要进行数据频率的转换，比方在时，日，周，月，季度和年度数据之间进行转换。从高频率数据聚合产生低频率数据称为降采样(downsampling),而将低频了数据转换为高频率这称为升采样（upsampling）。在pandas中，用resample方法实现频率转换。


In [194]:
rng = pd.date_range('1/1/2000', periods=100, freq='D')
ts = Series(randn(len(rng)), index=rng)
ts

2000-01-01   -0.028560
2000-01-02   -0.056850
2000-01-03   -0.695159
2000-01-04   -0.306225
2000-01-05   -0.328872
                ...   
2000-04-05    0.236040
2000-04-06   -0.672436
2000-04-07   -0.286239
2000-04-08   -1.376756
2000-04-09   -0.017042
Freq: D, Length: 100, dtype: float64

In [195]:
ts.resample?

[1;31mSignature:[0m
[0mts[0m[1;33m.[0m[0mresample[0m[1;33m([0m[1;33m
[0m    [0mrule[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mclosed[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlabel[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mconvention[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m'start'[0m[1;33m,[0m[1;33m
[0m    [0mkind[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mloffset[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mbase[0m[1;33m:[0m [1;34m'int | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mon[0m[1;33m:[0m [1;34m'Level'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlevel[0m[1;33m:[0m [1;34m'Level'[0m [1;33m=[0m [1;32mNone[0m[1;3

In [193]:
ts.resample('M').mean()#按照月resample

2000-01-31   -0.227572
2000-02-29    0.313793
2000-03-31   -0.027252
2000-04-30    0.113460
Freq: M, dtype: float64

In [100]:
ts.resample('M',kind='period').sum()

2000-01   -4.091241
2000-02   -4.641962
2000-03    3.025323
2000-04    0.397917
Freq: M, dtype: float64

parameter| decription
------ |------
rule|  string the offset string or object representing target conversion
**how** | string method for down- or re-sampling, default to ‘mean’ for downsampling
axis | int, optional, default 0
fill_method  | string, default None fill_method for upsampling
**closed**  | {‘right’, ‘left’} Which side of bin interval is closed
**label**|  {‘right’, ‘left’} Which bin edge label to label bucket with
convention | {‘start’, ‘end’, ‘s’, ‘e’}
kind|  “period”/”timestamp”
loffset | timedelta Adjust the resampled time labels
limit| int, default None, Maximum size gap to when reindexing with fill_method
base |  int, default 0 For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0

### Downsampling

利用高频数据作低频的汇总

关键是要明确计算的区间，


In [196]:
rng = pd.date_range('1/1/2000', periods=12, freq='T')
ts = Series(np.arange(12), index=rng)
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

In [197]:
ts.resample('5min').sum()
# note: output changed (as the default changed from closed='right', label='right' to closed='left', label='left'

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

In [201]:
ts.resample('5min',closed='left').sum()
# 包含左 端点，但不包含右端点

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

In [199]:
ts.resample('5min', closed='left', label='right').sum()

2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
2000-01-01 00:15:00    21
Freq: 5T, dtype: int32

In [105]:
ts.resample('5min', loffset='-1s').sum()
#loffset=-1s,时间向前偏差1秒

1999-12-31 23:59:59    10
2000-01-01 00:04:59    35
2000-01-01 00:09:59    21
Freq: 5T, dtype: int32

#### Open-High-Low-Close (OHLC) resampling

金融中计算区间的开盘价、收盘价、最高、最低价。

In [202]:
ts.resample('5min').ohlc()
# note: output changed because of changed defaults

Unnamed: 0,open,high,low,close
2000-01-01 00:00:00,0,4,0,4
2000-01-01 00:05:00,5,9,5,9
2000-01-01 00:10:00,10,11,10,11


#### Resampling with GroupBy

In [None]:
ts.groupby?

In [206]:
rng = pd.date_range('1/1/2000', periods=1000, freq='D')
ts = Series(np.arange(1000), index=rng)
print(ts)
ts.groupby(lambda x: x.month).mean()#groupby需要给一个等长的向量
#groupby根据index元素的函数进行

2000-01-01      0
2000-01-02      1
2000-01-03      2
2000-01-04      3
2000-01-05      4
             ... 
2002-09-22    995
2002-09-23    996
2002-09-24    997
2002-09-25    998
2002-09-26    999
Freq: D, Length: 1000, dtype: int32


1     380.666667
2     406.035294
3     440.000000
4     470.500000
5     501.000000
6     531.500000
7     562.000000
8     593.000000
9     605.918605
10    471.500000
11    502.000000
12    532.500000
dtype: float64

In [210]:
ts.groupby(lambda x: x.weekday).mean()

0    499.0
1    500.0
2    501.0
3    502.0
4    499.5
5    497.0
6    498.0
dtype: float64

1. 计算所有月同一日对应的数值的和
2. 计算所有工作日（周一到周五）和休息日的均值（周六周日）

In [109]:
ts.groupby(lambda x: x.weekday() in [1,2,3,4,5]).mean()

False    498.50000
True     499.90056
dtype: float64

### Upsampling and interpolation******

升采样和插值，
升采样时，有的值就会缺失，因此可以能要插值。

In [211]:
frame = DataFrame(np.random.randn(2, 4),
                  index=pd.date_range('1/1/2000', periods=2, freq='W-WED'),
                  columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,0.11483,1.40996,-0.626491,0.857395
2000-01-12,-0.187378,-0.805177,-1.532256,-1.223429


In [111]:
frame.resample('D').first()
 

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.200353,0.840303,-0.283287,-1.228777
2000-01-06,,,,
2000-01-07,,,,
2000-01-08,,,,
2000-01-09,,,,
2000-01-10,,,,
2000-01-11,,,,
2000-01-12,-0.649803,2.510537,-0.467724,-0.614696


In [112]:
frame.resample('D').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.200353,0.840303,-0.283287,-1.228777
2000-01-06,1.200353,0.840303,-0.283287,-1.228777
2000-01-07,1.200353,0.840303,-0.283287,-1.228777
2000-01-08,1.200353,0.840303,-0.283287,-1.228777
2000-01-09,1.200353,0.840303,-0.283287,-1.228777
2000-01-10,1.200353,0.840303,-0.283287,-1.228777
2000-01-11,1.200353,0.840303,-0.283287,-1.228777
2000-01-12,-0.649803,2.510537,-0.467724,-0.614696


In [113]:
frame.resample('D').ffill(limit=2)

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,1.200353,0.840303,-0.283287,-1.228777
2000-01-06,1.200353,0.840303,-0.283287,-1.228777
2000-01-07,1.200353,0.840303,-0.283287,-1.228777
2000-01-08,,,,
2000-01-09,,,,
2000-01-10,,,,
2000-01-11,,,,
2000-01-12,-0.649803,2.510537,-0.467724,-0.614696


In [114]:
frame.resample('W-THU').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-06,1.200353,0.840303,-0.283287,-1.228777
2000-01-13,-0.649803,2.510537,-0.467724,-0.614696


### Resampling with periods

通过时期采样

In [115]:
frame = DataFrame(np.random.randn(24, 4),
                  index=pd.period_range('1-2000', '12-2001', freq='M'),
                  columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01,-1.111744,-0.696383,0.254506,-1.370185
2000-02,-0.205392,-0.433984,0.512341,1.361802
2000-03,-0.09447,0.127671,-0.616774,-3.172215
2000-04,0.649893,-0.761473,0.022421,0.779078
2000-05,-1.837773,0.26387,-0.688442,-0.766511


In [116]:
annual_frame = frame.resample('A-DEC').mean()
annual_frame

Unnamed: 0,Colorado,Texas,New York,Ohio
2000,-0.406554,0.170947,-0.313566,-0.286282
2001,0.205663,-0.332031,-0.09591,-0.279879


In [117]:
# Q-DEC: Quarterly, year ending in December
annual_frame.resample('Q-DEC').ffill()
# note: output changed, default value changed from convention='end' to convention='start' + 'start' changed to span-like
# also the following cells

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q1,-0.406554,0.170947,-0.313566,-0.286282
2000Q2,-0.406554,0.170947,-0.313566,-0.286282
2000Q3,-0.406554,0.170947,-0.313566,-0.286282
2000Q4,-0.406554,0.170947,-0.313566,-0.286282
2001Q1,0.205663,-0.332031,-0.09591,-0.279879
2001Q2,0.205663,-0.332031,-0.09591,-0.279879
2001Q3,0.205663,-0.332031,-0.09591,-0.279879
2001Q4,0.205663,-0.332031,-0.09591,-0.279879


In [118]:
annual_frame.resample('Q-MAR').ffill()

Unnamed: 0,Colorado,Texas,New York,Ohio
2000Q4,-0.406554,0.170947,-0.313566,-0.286282
2001Q1,-0.406554,0.170947,-0.313566,-0.286282
2001Q2,-0.406554,0.170947,-0.313566,-0.286282
2001Q3,-0.406554,0.170947,-0.313566,-0.286282
2001Q4,0.205663,-0.332031,-0.09591,-0.279879
2002Q1,0.205663,-0.332031,-0.09591,-0.279879
2002Q2,0.205663,-0.332031,-0.09591,-0.279879
2002Q3,0.205663,-0.332031,-0.09591,-0.279879


A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset aliases (referred to as time rules prior to v0.8.0).

Alias |	Description
------|------
B |	business day frequency
C|custom business day frequency (experimental)
D|	calendar day frequency
W|	weekly frequency
M|	month end frequency
BM	|business month end frequency
CBM|	custom business month end frequency
MS|	month start frequency
BMS|	business month start frequency
CBMS|	custom business month start frequency
Q|	quarter end frequency
BQ	|business quarter endfrequency
QS|	quarter start frequency
BQS|	business quarter start frequency
A|	year end frequency
BA	|business year end frequency
AS|	year start frequency
BAS|	business year start frequency
BH|	business hour frequency
H|	hourly frequency
T	|minutely frequency
S	|secondly frequency
L|	milliseonds
U	|microseconds
N|	nanoseconds

In [119]:
ts.groupby(lambda x: x.day).sum()
ts.groupby(lambda x: x.dayofweek<5).mean() 

False    497.50000
True     500.30112
dtype: float64

In [121]:
%pwd

'D:\\teaching\\金融数据分析datafin'

In [124]:
path='D:\\teaching\\金融数据分析datafin\\内盘合约日线数据\\'
xl = pd.ExcelFile(path+"大商所-淀粉.xlsx")
xl.sheet_names 

Da=pd.read_excel(path+"大商所-淀粉.xlsx",None,skiprows=[0,1],usecols=None ,index_col=0)
Da[xl.sheet_names[0]]["2015-10-1":"2015-10-10"]

Unnamed: 0_level_0,开盘价,最高价,最低价,收盘价,成交量
日期,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-10-08,2145,2145,2037,2041,168204
2015-10-09,2042,2090,2037,2065,170228


In [125]:
D1=pd.read_excel(path+"大商所-淀粉.xlsx",skiprows=[0,1],rows=10,sheet_name=0,usecols=None ,index_col=0)
D1["2015-1"]

Unnamed: 0_level_0,开盘价,最高价,最低价,收盘价,成交量
日期,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-01-19,2800,2800,2674,2707,12
2015-01-20,2743,2743,2743,2743,0
2015-01-21,2756,2756,2676,2711,18
2015-01-22,2720,2720,2720,2720,2
2015-01-23,2756,2756,2697,2697,4
2015-01-26,2726,2726,2726,2726,0
2015-01-27,2726,2726,2726,2726,0
2015-01-28,2720,2720,2720,2720,8
2015-01-29,2759,2759,2739,2750,26
2015-01-30,2745,2745,2745,2745,0
