# Time series analysis

## the dateutil package

The mostly common used package is the datetime library, which requires specifying the format of date. Another package, dateutil, can do this automatically, and is also used as the backend for pandas pd.to_numeric function.

In [1]:
from dateutil.parser import parse

In [8]:
auto_detected_date1 = parse('2010-01-01')
auto_detected_date2 = parse('20100305')

In [9]:
print(auto_detected_date1)
print(auto_detected_date2)

2010-01-01 00:00:00
2010-03-05 00:00:00


The only caveate is, by convention, in US, month comes first and date later. However, sometimes we might want date and month flipped, since it is an international format.

In [12]:
auto_detected_date2 = parse('03/05/2012', dayfirst=True)
auto_detected_date2

datetime.datetime(2012, 5, 3, 0, 0)

## Indexing by timestamp

The indexing, if time stamps are used as indices, turns out to be more versatile. For instance, we can only specify a year and retrieve all days from that year. Additionally, we can pass in a string, and pandas is going to convert to time stamp objects under the hood.

In [18]:
import pandas as pd 
import numpy as np 
two_years_data = pd.Series(np.random.randn(365*2), index=pd.date_range('2013-01-01', '2014-12-31', freq='D'))
choose_one_year = two_years_data['2014'].count()
choose_one_month = two_years_data['2014-01'].count()
print(choose_one_year, choose_one_month)

365 31


Truncate, is another method to get partial data from a series indexed by time stamp. To be clear, truncate before a date mean the data before that date is going to be discarded.

In [31]:
s = two_years_data.truncate(before='2014-01')
print(s.index[0])
print(s.index[1])

2014-01-01 00:00:00
2014-01-02 00:00:00


We can generate a list of dates as index easily by specify the begining and the end. The default frequency is day, and each element in the range is a time stamp object. Besides day, we can also choose the interval to be week (W-MON) or business day (B) and a lot other options. We can even use customized frequency like '1h30min'.

In [104]:
a_series_of_dates = pd.date_range('2010-01-01', '2010-01-20')
type(a_series_of_dates[0])

pandas._libs.tslibs.timestamps.Timestamp

In [105]:
business_day = pd.date_range('2010-01-01', '2010-01-20', freq='B')
print(business_day)

DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14', '2010-01-15', '2010-01-18',
               '2010-01-19', '2010-01-20'],
              dtype='datetime64[ns]', freq='B')


# shift the series

This is useful when one needs calculating the day by day percentage change. The only argument one needs to pass is how many time units one needs to push back. And if the number of negative, the series is going to be pushed in another direction.

In [110]:
ten_days_data = pd.Series(np.random.randn(10), index=pd.date_range('2013-01-01', '2013-01-10', freq='D'))
ten_days_data

2013-01-01   -0.448277
2013-01-02   -1.246988
2013-01-03   -0.095033
2013-01-04   -0.456230
2013-01-05   -1.113853
2013-01-06   -0.513443
2013-01-07   -0.203354
2013-01-08   -2.141995
2013-01-09   -1.059850
2013-01-10    0.294845
Freq: D, dtype: float64

In [115]:
ten_days_data.shift(2)

2013-01-01         NaN
2013-01-02         NaN
2013-01-03   -0.448277
2013-01-04   -1.246988
2013-01-05   -0.095033
2013-01-06   -0.456230
2013-01-07   -1.113853
2013-01-08   -0.513443
2013-01-09   -0.203354
2013-01-10   -2.141995
Freq: D, dtype: float64

In [116]:
ten_days_data.shift(2, freq='D')

2013-01-03   -0.448277
2013-01-04   -1.246988
2013-01-05   -0.095033
2013-01-06   -0.456230
2013-01-07   -1.113853
2013-01-08   -0.513443
2013-01-09   -0.203354
2013-01-10   -2.141995
2013-01-11   -1.059850
2013-01-12    0.294845
Freq: D, dtype: float64

If no frequency is specified, then a naive shift is going to be done, which means the index is not going to be modified. In other words, the first 2 units is going to be NA, and the last to observations are going to be dropped. However, if we specify the correct frequency, then the timestamps as the index will also be advanced. In this case, the last two observations are going be used for 11th and 12th of Jan.

To summaries, there's a few ways to construct the datetime index.
1. The first way is to construct with pd.date_range(), as we have discussed before. 
2. pd.DatetimeIndex, can also accept a list of strings and parse them automatically to get timestamp and construct datetime index thereafter.
3. Similar to pd.DatetimeIndex, pd.to_datetime also serves the same functionality, the result of which can be used as index.

In [118]:
list_of_strings = ['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
 '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08','2013-01-09', '2013-01-10']

In [128]:
from_strings_1 = pd.DatetimeIndex(list_of_strings, freq='D')
from_strings_2 = pd.to_datetime(list_of_strings) # freq='D' not allowed here but can be set later
from_strings_2.freq='D'

In [130]:
(from_strings_1 == from_strings_2).all()

True

Obviously, we can also construct datetime index object from a datetime object. However, this is not terribly useful because pandas can convert from strings anyway.

In [132]:
from datetime import datetime
now = datetime.now()
pd.DatetimeIndex([now])

DatetimeIndex(['2018-12-24 21:40:48.022835'], dtype='datetime64[ns]', freq=None)

A different kind of index is called period index, which unlike timestamp, defines a segment of time span. One way to make this clear is the period object has a start_time attribute (a time stamp) and an end_time attribute (another time stamp).

Another things to notice: there is both period and periodindex object in pandas, but for time stamp there is only datetimeindex object, but no datetime object. My speculation is there is a datetime class in built-in Python library already and why pandas developers skipped it.  

Note that freq is optional for Period object but mandatory for PeriodIndex.

In [158]:
p = pd.Period(2007, freq='A-DEC')
print(p, p.start_time, p.end_time)
print(p + 1) #add and subtract, obviously

2007 2007-01-01 00:00:00 2007-12-31 23:59:59.999999999
2008


Period('2007-12-31', 'D')

In [141]:
p

Period('2007', 'A-DEC')

p