# 11.2 Time Series Basics

在pandas中，一个基本的时间序列对象，是一个用时间戳作为索引的Series，在pandas外部的话，通常是用python 字符串或datetime对象来表示的：

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8), 
         datetime(2011, 1, 10), datetime(2011, 1, 12)]

In [3]:
ts = pd.Series(np.random.randn(6), index=dates)
ts

2011-01-02   -1.035874
2011-01-05   -1.202860
2011-01-07    0.485130
2011-01-08    0.387001
2011-01-10    0.886668
2011-01-12   -0.308020
dtype: float64

上面的转化原理是，datetime对象被放进了DatetimeIndex:

In [4]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

像其他的Series一行，数值原色会自动按时间序列索引进行对齐：

In [5]:
ts[::2]

2011-01-02   -1.035874
2011-01-07    0.485130
2011-01-10    0.886668
dtype: float64

In [6]:
ts + ts[::2]

2011-01-02   -2.071747
2011-01-05         NaN
2011-01-07    0.970260
2011-01-08         NaN
2011-01-10    1.773336
2011-01-12         NaN
dtype: float64

In [7]:
ts.index.dtype

dtype('<M8[ns]')

DatetimeIndex的标量是pandas的Timestamp对象：

In [9]:
stamp = ts.index[0]
stamp

Timestamp('2011-01-02 00:00:00')

Timestamp可以在任何地方用datetime对象进行替换。

# 1 Indexing, Selection, Subsetting（索引，选择，取子集）

当我们基于标签进行索引和选择时，时间序列就像是pandas.Series：

In [10]:
ts

2011-01-02   -1.035874
2011-01-05   -1.202860
2011-01-07    0.485130
2011-01-08    0.387001
2011-01-10    0.886668
2011-01-12   -0.308020
dtype: float64

In [11]:
stamp = ts.index[2]

In [12]:
ts[stamp]

0.4851297924921443

为了方便，我们可以直接传入一个字符串用来表示日期：

In [13]:
ts['1/10/2011']

0.8866677578640286

In [14]:
ts['20110110']

0.8866677578640286

对于比较长的时间序列，我们可以直接传入一年或一年一个月，来进行数据选取：

In [15]:
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2000', periods=1000))
longer_ts

2000-01-01   -0.898782
2000-01-02   -0.760490
2000-01-03    1.727453
2000-01-04   -1.183669
2000-01-05    1.130868
2000-01-06   -2.793887
2000-01-07    0.359611
2000-01-08    0.138000
2000-01-09    0.230698
2000-01-10    1.185789
2000-01-11   -0.644326
2000-01-12    0.912081
2000-01-13   -1.738916
2000-01-14    1.581153
2000-01-15    0.546533
2000-01-16   -1.573190
2000-01-17   -1.690137
2000-01-18    1.456726
2000-01-19   -0.099057
2000-01-20    1.294297
2000-01-21   -0.987475
2000-01-22    0.220771
2000-01-23   -1.171818
2000-01-24   -0.189120
2000-01-25   -1.116764
2000-01-26    0.324010
2000-01-27   -0.680996
2000-01-28   -0.745145
2000-01-29    0.607720
2000-01-30    0.195630
                ...   
2002-08-28   -2.171114
2002-08-29    1.159835
2002-08-30   -0.106134
2002-08-31    0.222484
2002-09-01   -1.609198
2002-09-02   -0.903979
2002-09-03   -1.468260
2002-09-04    2.102030
2002-09-05   -1.571241
2002-09-06   -1.125820
2002-09-07    0.290032
2002-09-08   -0.162553
2002-09-09 

In [16]:
longer_ts['2001']

2001-01-01    0.594825
2001-01-02   -0.138438
2001-01-03    0.426880
2001-01-04    0.890645
2001-01-05   -0.635403
2001-01-06    0.608222
2001-01-07    0.970126
2001-01-08    0.764653
2001-01-09    0.326323
2001-01-10   -0.301089
2001-01-11   -0.504722
2001-01-12   -1.060614
2001-01-13    0.725708
2001-01-14    0.278906
2001-01-15    0.711877
2001-01-16    1.324654
2001-01-17    0.356257
2001-01-18    0.999546
2001-01-19   -0.833555
2001-01-20   -0.567741
2001-01-21   -1.270620
2001-01-22   -0.329581
2001-01-23   -1.962586
2001-01-24    0.222409
2001-01-25   -1.430861
2001-01-26   -0.846926
2001-01-27   -0.236396
2001-01-28    0.759687
2001-01-29   -2.051539
2001-01-30    0.437078
                ...   
2001-12-02   -0.956339
2001-12-03   -1.462125
2001-12-04   -0.708139
2001-12-05   -0.790011
2001-12-06   -0.484842
2001-12-07    1.155748
2001-12-08   -0.543679
2001-12-09   -0.034427
2001-12-10    1.136830
2001-12-11   -0.831510
2001-12-12    0.618170
2001-12-13    0.039535
2001-12-14 

In [17]:
longer_ts['2001-05']

2001-05-01   -1.260731
2001-05-02   -1.076953
2001-05-03   -1.273644
2001-05-04    0.186166
2001-05-05   -2.045040
2001-05-06    0.388047
2001-05-07   -0.169041
2001-05-08   -1.272060
2001-05-09    0.664821
2001-05-10    0.184446
2001-05-11   -0.686513
2001-05-12   -0.160608
2001-05-13   -1.433813
2001-05-14   -0.152434
2001-05-15    0.559079
2001-05-16   -1.202589
2001-05-17   -0.103353
2001-05-18    0.769911
2001-05-19   -1.768087
2001-05-20    0.369396
2001-05-21    0.835071
2001-05-22    0.854277
2001-05-23   -1.421996
2001-05-24    0.003957
2001-05-25   -0.419054
2001-05-26   -0.244487
2001-05-27    0.161019
2001-05-28   -0.139120
2001-05-29   -0.763757
2001-05-30    0.499741
2001-05-31    1.010171
Freq: D, dtype: float64

利用datetime进行切片（slicing）也没问题：

In [18]:
ts[datetime(2011, 1, 7)]

0.4851297924921443

In [19]:
ts

2011-01-02   -1.035874
2011-01-05   -1.202860
2011-01-07    0.485130
2011-01-08    0.387001
2011-01-10    0.886668
2011-01-12   -0.308020
dtype: float64

In [20]:
ts['1/6/2011':'1/11/2011']

2011-01-07    0.485130
2011-01-08    0.387001
2011-01-10    0.886668
dtype: float64

In [21]:
ts.truncate(after='1/9/2011')

2011-01-02   -1.035874
2011-01-05   -1.202860
2011-01-07    0.485130
2011-01-08    0.387001
dtype: float64

所有这些都适用于DataFrame，我们对行进行索引：

In [22]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')

In [23]:
long_df = pd.DataFrame(np.random.randn(100, 4),
                       index=dates,
                       columns=['Colorado', 'Texas',
                                'New York', 'Ohio'])

In [24]:
long_df.loc['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,0.783675,0.107951,0.166675,-0.673468
2001-05-09,-0.366542,0.88129,-1.863978,-0.326421
2001-05-16,0.339688,0.260875,-1.562306,1.027188
2001-05-23,0.425704,-2.158926,0.423621,-0.171503
2001-05-30,1.667901,-0.152497,-0.344973,-0.826279


# 2 Time Series with Duplicate Indices（重复索引的时间序列）

在某些数据中，可能会遇到多个数据在同一时间戳下的情况：

In [25]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', 
                          '1/2/2000', '1/3/2000'])

In [26]:
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

我们通过is_unique属性来查看index是否是唯一值：

In [27]:
dup_ts.index.is_unique

False

对这个时间序列取索引的的话， 要么得到标量，要么得到切片，这取决于时间戳是否是重复的：

In [28]:
dup_ts['1/3/2000'] # not duplicated

4

In [29]:
dup_ts['1/2/2000'] # duplicated

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

假设我们想要聚合那些有重复时间戳的数据，一种方法是用groupby，设定level=0：

In [30]:
grouped = dup_ts.groupby(level=0)
grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

In [31]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64