# 11.2 时间序列基础

pandas最基本的时间序列类型就是以时间戳（通常以Python字符串或datatime对象表示）为索引的Series：

In [16]:
from datetime import datetime

import pandas as pd
import numpy as np



In [17]:
dates = [datetime(2011, 1, 2),  
         datetime(2011, 1, 5),
         datetime(2011, 1, 7),  
         datetime(2011, 1, 8),
         datetime(2011, 1, 10), 
         datetime(2011, 1, 12)
        ]


ts = pd.Series(np.random.randn(6), index=dates)

In [12]:
ts

2011-01-02   -0.408426
2011-01-05    1.598457
2011-01-07    0.383061
2011-01-08   -1.022420
2011-01-10   -1.760558
2011-01-12   -0.384750
dtype: float64

这些datetime对象实际上是被放在一个DatetimeIndex中的：

In [9]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

跟其他Series一样，不同索引的时间序列之间的算术运算会自动按日期对齐：

In [10]:
ts[::2]

2011-01-02   -0.408426
2011-01-07    0.383061
2011-01-10   -1.760558
dtype: float64

In [11]:
ts + ts[::2]

2011-01-02   -0.816852
2011-01-05         NaN
2011-01-07    0.766122
2011-01-08         NaN
2011-01-10   -3.521116
2011-01-12         NaN
dtype: float64

pandas用NumPy的datetime64数据类型以纳秒形式存储时间戳：

In [13]:
ts.index.dtype

dtype('<M8[ns]')

DatetimeIndex中的各个标量值是pandas的Timestamp对象：

In [21]:
stamp = ts.index[0]

stamp

Timestamp('2011-01-02 00:00:00')

只要有需要，TimeStamp可以随时自动转换为datetime对象。此外，它还可以存储频率信息（如果有的话），且知道如何执行时区转换以及其他操作。稍后将对此进行详细讲解。

## 索引、选取、子集构造

当你根据标签索引选取数据时，时间序列和其它的pandas.Series很像：

In [28]:
stamp = ts.index[2]

stamp

Timestamp('2011-01-07 00:00:00')

In [30]:
ts

2011-01-02   -1.483063
2011-01-05    0.379312
2011-01-07   -0.471273
2011-01-08   -0.051339
2011-01-10   -1.132343
2011-01-12    0.473631
dtype: float64

In [29]:
ts[stamp]

-0.47127311798706856

还有一种更为方便的用法：传入一个可以被解释为日期的字符串：

In [24]:
ts['1/10/2011']

-1.1323429221231307

In [25]:
ts['20110110']

-1.1323429221231307

对于较长的时间序列，只需传入“年”或“年月”即可轻松选取数据的切片：

In [31]:
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2000', periods=1000))

longer_ts

2000-01-01   -0.096114
2000-01-02    1.169649
2000-01-03    0.093056
2000-01-04    0.103898
2000-01-05   -0.736592
                ...   
2002-09-22   -1.030183
2002-09-23   -0.656297
2002-09-24   -1.062430
2002-09-25   -0.994502
2002-09-26    1.553867
Freq: D, Length: 1000, dtype: float64

In [32]:
longer_ts['2001']

2001-01-01   -1.809703
2001-01-02    0.676659
2001-01-03    0.063402
2001-01-04   -0.644760
2001-01-05   -0.396866
                ...   
2001-12-27   -1.767160
2001-12-28    1.638261
2001-12-29    1.523844
2001-12-30    0.126169
2001-12-31   -0.443061
Freq: D, Length: 365, dtype: float64

这里，字符串“2001”被解释成年，并根据它选取时间区间。指定月也同样奏效：

In [33]:
longer_ts['2001-05']

2001-05-01    0.145742
2001-05-02   -0.506860
2001-05-03   -0.345550
2001-05-04    1.024504
2001-05-05    0.585017
2001-05-06   -0.004322
2001-05-07   -1.253289
2001-05-08   -0.086394
2001-05-09    1.326425
2001-05-10    0.193412
2001-05-11    0.160195
2001-05-12    3.020535
2001-05-13    0.595003
2001-05-14   -1.494349
2001-05-15   -0.149323
2001-05-16    0.585441
2001-05-17    0.510000
2001-05-18   -1.176993
2001-05-19    0.836444
2001-05-20   -0.978820
2001-05-21   -0.497039
2001-05-22    0.387140
2001-05-23   -1.259972
2001-05-24    0.230182
2001-05-25   -0.047785
2001-05-26   -0.942228
2001-05-27    1.335167
2001-05-28    0.043434
2001-05-29    1.112417
2001-05-30   -0.143607
2001-05-31   -0.220669
Freq: D, dtype: float64

datetime对象也可以进行切片：

In [34]:
ts[datetime(2011, 1, 7):]

2011-01-07   -0.471273
2011-01-08   -0.051339
2011-01-10   -1.132343
2011-01-12    0.473631
dtype: float64

由于大部分时间序列数据都是按照时间先后排序的，因此你也可以用不存在于该时间序列中的时间戳对其进行切片（即范围查询）：

In [35]:
ts

2011-01-02   -1.483063
2011-01-05    0.379312
2011-01-07   -0.471273
2011-01-08   -0.051339
2011-01-10   -1.132343
2011-01-12    0.473631
dtype: float64

In [36]:
ts['1/6/2011':'1/11/2011']

2011-01-07   -0.471273
2011-01-08   -0.051339
2011-01-10   -1.132343
dtype: float64

跟之前一样，你可以传入字符串日期、datetime或Timestamp。注意，这样切片所产生的是原时间序列的视图，跟NumPy数组的切片运算是一样的。

这意味着，没有数据被复制，对切片进行修改会反映到原始数据上。

此外，还有一个等价的实例方法也可以截取两个日期之间TimeSeries：

In [37]:
ts.truncate(after='1/9/2011')

2011-01-02   -1.483063
2011-01-05    0.379312
2011-01-07   -0.471273
2011-01-08   -0.051339
dtype: float64

面这些操作对DataFrame也有效。例如，对DataFrame的行进行索引：

In [39]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')

dates

DatetimeIndex(['2000-01-05', '2000-01-12', '2000-01-19', '2000-01-26',
               '2000-02-02', '2000-02-09', '2000-02-16', '2000-02-23',
               '2000-03-01', '2000-03-08', '2000-03-15', '2000-03-22',
               '2000-03-29', '2000-04-05', '2000-04-12', '2000-04-19',
               '2000-04-26', '2000-05-03', '2000-05-10', '2000-05-17',
               '2000-05-24', '2000-05-31', '2000-06-07', '2000-06-14',
               '2000-06-21', '2000-06-28', '2000-07-05', '2000-07-12',
               '2000-07-19', '2000-07-26', '2000-08-02', '2000-08-09',
               '2000-08-16', '2000-08-23', '2000-08-30', '2000-09-06',
               '2000-09-13', '2000-09-20', '2000-09-27', '2000-10-04',
               '2000-10-11', '2000-10-18', '2000-10-25', '2000-11-01',
               '2000-11-08', '2000-11-15', '2000-11-22', '2000-11-29',
               '2000-12-06', '2000-12-13', '2000-12-20', '2000-12-27',
               '2001-01-03', '2001-01-10', '2001-01-17', '2001-01-24',
      

In [41]:
long_df = pd.DataFrame(np.random.randn(100, 4),
                       index=dates,
                       columns=['Colorado', 'Texas','New York', 'Ohio'])

long_df

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,0.065952,-0.191071,0.494759,-0.219107
2000-01-12,2.472006,-0.691622,1.228253,0.509034
2000-01-19,0.677090,0.681932,-0.962329,-1.313845
2000-01-26,1.149651,0.338954,0.911078,0.447360
2000-02-02,0.056815,-1.238735,-1.091003,-0.386503
...,...,...,...,...
2001-10-31,-1.256055,-0.520620,-3.559648,-1.339546
2001-11-07,2.256898,2.960218,-2.292915,-0.380202
2001-11-14,0.097468,-0.665734,-0.501976,-0.980440
2001-11-21,-1.798154,0.245858,1.603977,2.225696


In [42]:
long_df.loc['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,0.413034,-0.759442,0.772159,1.029921
2001-05-09,-0.389429,-0.162595,0.800556,-0.760673
2001-05-16,-1.921415,0.555918,-1.018637,1.658074
2001-05-23,-0.447243,-0.74307,-0.211915,-1.252747
2001-05-30,0.684598,-0.165927,0.441707,-0.535392


## 带有重复索引的时间序列

在某些应用场景中，可能会存在多个观测数据落在同一个时间点上的情况。下面就是一个例子：

In [44]:
dates = pd.DatetimeIndex(['1/1/2000', 
                          '1/2/2000', 
                          '1/2/2000',
                          '1/2/2000', 
                          '1/3/2000'])

dup_ts = pd.Series(np.arange(5), index=dates)

dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int64

通过检查索引的is_unique属性，我们就可以知道它是不是唯一的：

In [45]:
dup_ts.index.is_unique

False

对这个时间序列进行索引，要么产生标量值，要么产生切片，具体要看所选的时间点是否重复：

In [46]:
dup_ts['1/3/2000']  # not duplicated

4

In [47]:
dup_ts['1/2/2000']  # duplicated

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int64

假设你想要对具有非唯一时间戳的数据进行聚合。一个办法是使用groupby，并传入level=0：

In [49]:
grouped = dup_ts.groupby(level=0)

grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc11a3caee0>

In [50]:
grouped.mean()

2000-01-01    0.0
2000-01-02    2.0
2000-01-03    4.0
dtype: float64

In [51]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64