### 名词解释

类型|说明
|-|-|
date|公历形式存储的日历日期(年、月、日)
time| 将时间存储为时、分、秒、毫秒
datetime | 存储日期和时间
timedelta | 两个datetime值之间的差(日、秒、毫秒)

## 字符串解析为时间/字符串和日期的转换

### datetime
以毫秒形式存储日期和时间

In [27]:
from datetime import datetime
import pandas as pd

In [4]:
now = datetime.now()
now

datetime.datetime(2022, 5, 20, 16, 47, 25, 253281)

In [5]:
now.year, now.month, now.day

(2022, 5, 20)

In [7]:
# timedelta表示两个datetime对象之间的时间差
delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
delta

datetime.timedelta(days=926, seconds=56700)

In [8]:
delta.days, delta.seconds

(926, 56700)

In [14]:
# datetime对象加上（或减去）一个或多个timedelta，这样会产生一个新对象
from datetime import timedelta
start = datetime(2011, 1, 7)
start + timedelta(12)

datetime.datetime(2011, 1, 19, 0, 0)

In [15]:
start - 2 * timedelta(12)

datetime.datetime(2010, 12, 14, 0, 0)

### 字符串和datetime的互相转换

In [18]:
# datetime, Timestamp -> 字符串
stamp = datetime(2011, 1, 3)
stamp

datetime.datetime(2011, 1, 3, 0, 0)

In [19]:
str(stamp)

'2011-01-03 00:00:00'

In [20]:
stamp.strftime('%Y-%m-%d') # 格式化地转换方法

'2011-01-03'

#### 手动解析

In [21]:
# 字符串 -> datetime
value = '2011-01-03'
datetime.strptime(value, '%Y-%m-%d') # 通过已知格式对进行日期解析

datetime.datetime(2011, 1, 3, 0, 0)

In [23]:
datestrs = ['7/6/2011', '8/6/2011']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

#### dateutil辅助解析

每次编写格式定义比较麻烦，而dateutil这个第三方包中的parser.parse方法（pandas中已经自动安装好了）封装了常见的日期格式(format)
+ dateutil可以解析几乎所有人类能够理解的日期表示形式
+ dateutil.parser是一个实用但不完美的工具。
  + 比如说，它会把一些原本不是日期的字符串认作是日期

In [24]:
from dateutil.parser import parse
parse('2011-01-03')

datetime.datetime(2011, 1, 3, 0, 0)

In [25]:
parse('Jan 31, 1997 10:45 PM') # 默认年在日之前

datetime.datetime(1997, 1, 31, 22, 45)

In [26]:
parse('6/12/2011', dayfirst=True) # 通过dayfirst参数表明日在年之前

datetime.datetime(2011, 12, 6, 0, 0)

#### 同时解析不同格式的日期字符串

to_datetime方法可以解析多种不同的日期表示形式
+ 对标准日期格式（如ISO8601）的解析非常快：

In [29]:
datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
pd.to_datetime(datestrs)

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

In [31]:
# 可以自动解析缺失值数据
# NaT（Not a Time）是pandas中时间戳数据的null值
idx = pd.to_datetime(datestrs + [None])
idx

DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)

In [32]:
pd.isnull(idx)

array([False, False,  True])

## pandas中datetime的使用
pandas最基本的时间序列类型就是以时间戳

+ 作为索引的datetime被放在DatetimeIndex对象中，各个datetime变为pandas的Timestamp标量值对象，只要有需要，TimeStamp可以随时自动转换为datetime对象
+ pandas用NumPy的datetime64数据类型以纳秒形式存储时间戳

In [35]:
from datetime import datetime
import numpy as np
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8),
         datetime(2011, 1, 10), datetime(2011, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts

2011-01-02    0.153997
2011-01-05   -1.183854
2011-01-07   -0.996456
2011-01-08    1.847443
2011-01-10    2.252554
2011-01-12    0.080305
dtype: float64

In [36]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

In [37]:
ts.index.dtype

dtype('<M8[ns]')

In [38]:
ts.index[0]

Timestamp('2011-01-02 00:00:00')

### 索引、切片、截取

+ 利用字符串日期、datetime或Timestamp切片所产生的是原时间序列的视图
+ 而截取和通过索引重新创建Series和

In [39]:
# 方式一：通过标签进行索引
stamp = ts.index[2]
ts[stamp]

-0.9964556899554602

In [43]:
# 方式二：通过可以被解释为日期的字符串进行索引
# 对于比较长的时间序列，只需传入“年”或“年月”即可轻松选取数据的切片
print(ts['1/10/2011'])
print(ts['20110110'])

2.252554490234618
2.252554490234618


In [42]:
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2000', periods=1000))
longer_ts

2000-01-01    1.321602
2000-01-02    0.869775
2000-01-03   -0.216551
2000-01-04    0.486702
2000-01-05   -0.864972
                ...   
2002-09-22   -0.048540
2002-09-23   -1.029282
2002-09-24    2.585740
2002-09-25   -0.135398
2002-09-26   -0.324830
Freq: D, Length: 1000, dtype: float64

In [44]:
longer_ts['2001'] # 解释为年

2001-01-01   -0.043641
2001-01-02    0.630063
2001-01-03    1.252487
2001-01-04    1.002574
2001-01-05   -0.989158
                ...   
2001-12-27    1.892158
2001-12-28    1.808449
2001-12-29   -0.129285
2001-12-30    0.338748
2001-12-31   -1.548591
Freq: D, Length: 365, dtype: float64

In [46]:
longer_ts['2001-12'] # 解释为年-月

2001-12-01   -0.782002
2001-12-02   -1.358875
2001-12-03   -1.401515
2001-12-04    0.600647
2001-12-05   -0.799351
2001-12-06    0.249229
2001-12-07   -2.000473
2001-12-08    0.002088
2001-12-09    0.431282
2001-12-10    1.016692
2001-12-11    0.449250
2001-12-12    1.184704
2001-12-13    0.476502
2001-12-14   -1.726572
2001-12-15   -0.258082
2001-12-16   -0.181253
2001-12-17   -0.048411
2001-12-18    1.231630
2001-12-19   -0.215196
2001-12-20   -0.204083
2001-12-21   -0.690289
2001-12-22    0.404661
2001-12-23    0.195217
2001-12-24    0.137903
2001-12-25    0.727022
2001-12-26   -0.707786
2001-12-27    1.892158
2001-12-28    1.808449
2001-12-29   -0.129285
2001-12-30    0.338748
2001-12-31   -1.548591
Freq: D, dtype: float64

In [49]:
# 方式三：利用datetime对象进行切片
print(ts[datetime(2011, 1, 7):])

2011-01-07   -0.996456
2011-01-08    1.847443
2011-01-10    2.252554
2011-01-12    0.080305
dtype: float64
2011-01-07   -0.996456
2011-01-08    1.847443
2011-01-10    2.252554
dtype: float64


In [None]:
# 上述两种切片方法也可以用不存在于该时间序列中的时间戳对其进行切片(范围查询)
# 这是因为大部分时间序列数据都是按照时间先后进行排序
print(ts['1/6/2011':'1/11/2011'])

In [50]:
# 截取两个日期之间TimeSeries
ts.truncate(after='1/9/2011')

2011-01-02    0.153997
2011-01-05   -1.183854
2011-01-07   -0.996456
2011-01-08    1.847443
dtype: float64

In [53]:
# DataFrame的上述操作同样有效
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED') # 通过范围创建时间索引
long_df = pd.DataFrame(np.random.randn(100, 4),
                       index=dates,
                       columns=['Colorado', 'Texas', 'New York', 'Ohio'])
long_df.loc['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,-0.501903,1.400425,0.676597,-1.142004
2001-05-09,0.378599,-0.689588,0.999881,1.408075
2001-05-16,-0.743081,0.210696,2.367079,0.302545
2001-05-23,-0.328865,0.022784,0.217754,1.403359
2001-05-30,0.346795,-0.141705,-0.502711,-0.378505


### 重复索引

In [56]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
                          '1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int64

In [57]:
# 检查是否唯一
dup_ts.index.is_unique

False

In [58]:
# 这时候索引可能会产生不同标量或切片
print(dup_ts['1/3/2000'])
print(dup_ts['1/2/2000'])

4
2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int64


In [62]:
# 对具有非唯一时间戳的数据进行聚合，可以使用groupby，并传入level=0
grouped = dup_ts.groupby(level=0)
print(grouped.mean())
print(grouped.count())

2000-01-01    0.0
2000-01-02    2.0
2000-01-03    4.0
dtype: float64
2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64


### 重采样、频率推断、范围
频率=基础频率*乘数

基础频率在Pandas中用日期偏移量表示

复杂的频率称为锚点偏移量，其特点是时间点间隔不是固定的

+ 用户可以根据实际需求自定义一些频率类以便提供pandas所没有的日期逻辑

In [63]:
ts

2011-01-02    0.153997
2011-01-05   -1.183854
2011-01-07   -0.996456
2011-01-08    1.847443
2011-01-10    2.252554
2011-01-12    0.080305
dtype: float64

In [67]:
# 通过重采样转换固定频率（每日）的时间序列
resampler = ts.resample('D')
resampler

<pandas.core.resample.DatetimeIndexResampler object at 0x7f98ea367d00>

In [72]:
# 日期范围：根据指定的频率生成指定长度的DatetimeIndex
# 频率：默认按天计算时间点
# periods: 生成的数据点数
# 默认会保留起始和结束时间戳的时间信息
index = pd.date_range('2012-04-01', '2012-06-01', periods=20)
index

20

In [74]:
pd.date_range(start='2012-04-01', periods=20, freq='BM')

DatetimeIndex(['2012-04-30', '2012-05-31', '2012-06-29', '2012-07-31',
               '2012-08-31', '2012-09-28', '2012-10-31', '2012-11-30',
               '2012-12-31', '2013-01-31', '2013-02-28', '2013-03-29',
               '2013-04-30', '2013-05-31', '2013-06-28', '2013-07-31',
               '2013-08-30', '2013-09-30', '2013-10-31', '2013-11-29'],
              dtype='datetime64[ns]', freq='BM')

In [75]:
# normalize自动规范化
pd.date_range(start='2012-04-01 12:56:31', periods=20, freq='BM', normalize=True)

DatetimeIndex(['2012-04-30', '2012-05-31', '2012-06-29', '2012-07-31',
               '2012-08-31', '2012-09-28', '2012-10-31', '2012-11-30',
               '2012-12-31', '2013-01-31', '2013-02-28', '2013-03-29',
               '2013-04-30', '2013-05-31', '2013-06-28', '2013-07-31',
               '2013-08-30', '2013-09-30', '2013-10-31', '2013-11-29'],
              dtype='datetime64[ns]', freq='BM')

#### 日期偏移量

In [76]:
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
4 * hour # 等价于 Hour(4)

<4 * Hours>

In [77]:
# 不同偏移量对象可以连接
Hour(2) + Minute(20)

<140 * Minutes>

In [78]:
pd.date_range('2000-01-01', periods=10, freq='1h30min') # 频率字符串(偏移量进行组合)

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

In [83]:
# WOM(Week of month)
rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')
rng

DatetimeIndex(['2012-01-20', '2012-02-17', '2012-03-16', '2012-04-20',
               '2012-05-18', '2012-06-15', '2012-07-20', '2012-08-17'],
              dtype='datetime64[ns]', freq='WOM-3FRI')

#### 时间序列的移动

将数据沿着时间轴前移或后移

+ 保持索引不变，所以部分数据会丢失

传入频率是对时间戳进行位移，而不是对数据进行位移

In [85]:
ts = pd.Series(np.random.randn(4),
               index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts

2000-01-31    0.139218
2000-02-29    0.699511
2000-03-31    0.896206
2000-04-30    0.878092
Freq: M, dtype: float64

In [86]:
ts.shift(2) # 后移

2000-01-31         NaN
2000-02-29         NaN
2000-03-31    0.139218
2000-04-30    0.699511
Freq: M, dtype: float64

In [87]:
ts.shift(-2) # 前移

2000-01-31    0.896206
2000-02-29    0.878092
2000-03-31         NaN
2000-04-30         NaN
Freq: M, dtype: float64

In [88]:
# 传入频率，实现对时间戳的位移
ts.shift(2, freq='M')

2000-03-31    0.139218
2000-04-30    0.699511
2000-05-31    0.896206
2000-06-30    0.878092
Freq: M, dtype: float64

In [91]:
ts.shift(1, freq='90min')

2000-01-31 01:30:00    0.139218
2000-02-29 01:30:00    0.699511
2000-03-31 01:30:00    0.896206
2000-04-30 01:30:00    0.878092
dtype: float64

#### 偏移量与datetime和Timestamp对象共同计算

In [93]:
from pandas.tseries.offsets import Day, MonthEnd
now = datetime(2011, 11, 7)
now + 3 * Day()

Timestamp('2011-11-10 00:00:00')

In [96]:
# 锚点偏移量，第一次增量会将原日期向前滚动到符合频率规则的下一个日期
now + MonthEnd()

Timestamp('2011-11-30 00:00:00')

In [97]:
offset = MonthEnd()
# 锚点偏移量的rollforward和rollback方法，可明确地将日期向前或向后“滚动”
print(offset.rollforward(now))
print(offset.rollback(now))

2011-11-30 00:00:00
2011-10-31 00:00:00


In [99]:
offset.rollforward

<function MonthEnd.rollforward>

In [101]:
ts = pd.Series(np.random.randn(20),
               index=pd.date_range('1/15/2000', periods=20, freq='4d'))
# groupby函数可以接受锚点偏移量对象的rollforward/rollback返回的方法来实现分组
ts.groupby(offset.rollforward).mean()

2000-01-31    0.547646
2000-02-29   -0.311816
2000-03-31   -0.742592
dtype: float64